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Preface 


Like the first edition, the second edition of the Handbook of Parametric and Nonparametric 
Statistical Procedures is designed to provide researchers, teachers, and students with a compre- 
hensive reference book in the areas of parametric and nonparametric statistics. The addition of 
a large amount of new material (250 pages) makes the Handbook unparalleled in terms of its 
coverage of material in the field of statistics. Rather than being directed at a limited audience, 
the Handbook is intended for individuals who are involved in a broad spectrum of academic 
disciplines encompassing the fields of mathematics/statistics, the social and biological sciences, 
business, and education. My philosophy in writing both this and the previous edition was to 
create a reference book on parametric and nonparametric statistical procedures that I (as well as 
colleagues and students I have spoken with over the years) have always wanted, yet could never 
find. To be more specific, my primary goal was to produce a comprehensive reference book on 
univariate and bivariate statistical procedures which covers a scope of material that extends far 
beyond that which is covered in any single available source. It was essential that the book be 
applications oriented, yet at the same time that it address relevant theoretical and practical issues 
which are of concern to the sophisticated researcher. In addition, I wanted to write a book that 
is accessible to people who have a limited knowledge of statistics, as well as those who are well 
versed in the subject. I believe I have achieved these goals, and on the basis of this I believe that 
the Handbook of Parametric and Nonparametric Statistical Procedures will continue to 
serve as an invaluable resource for people in multiple academic disciplines who conduct research, 
are involved in teaching, or are presently in the process of learning statistics. 

I am not aware of any applications-oriented book that provides in-depth coverage of as 
many statistical procedures as the number that are covered in the Handbook of Parametric and 
Nonparametric Statistical Procedures. Inspection of the Table of Contents and Index should 
confirm the scope of material covered in the book. A unique feature of the Handbook, which 
distinguishes it from other reference books on statistics, is that it provides the reader with a 
practical guide that emphasizes application over theory. Although the book will be of practical 
value to statistically sophisticated individuals who are involved in research, it is also accessible 
to those who lack the theoretical and/or mathematical background required for understanding the 
material documented in more conventional statistics reference books. Since a major goal of 
the book is to serve as a practical guide, emphasis is placed on decision making with respect 
to which test is most appropriate to employ in evaluating a specific design. Within the frame- 
work of being user-friendly, clear computational guidelines, accompanied by easy-to-understand 
examples, are provided for all procedures. 

One should not, however, get the impression that the Handbook of Parametric and Non- 
parametric Statistical Procedures is little more than a cookbook. In point of fact, the design 
of the Handbook is such that within the framework of each of the statistical procedures which 
are covered, in addition to the basic guidelines for decision making and computation, substantial 
in-depth discussion is devoted to a broad spectrum of practical and theoretical issues, many of 
which are not discussed in conventional statistics books. Inclusion of the latter material ensures 
that the Handbook will serve as an invaluable resource for those who are sophisticated as well 
as unsophisticated in statistics. 

In order to facilitate its usage, most of the procedures contained in the Handbook are 
organized within a standardized format. Specifically, for most of the procedures the following 
information is provided: 
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I. Hypothesis evaluated with test and relevant background information The first part 
of this section provides a general statement of the hypothesis evaluated with the test. This is 
followed by relevant background information on the test such as the following: a) Information 
regarding the experimental design for which the test is appropriate; b) Any assumptions under- 
lying the test which, if violated, would compromise its reliability; and c) General information on 
other statistical procedures that are related to the test. 

II. Example This section presents a description of an experiment, with an accompanying 
data set (or in some instances two experiments utilizing the same data set), for which the test will 
be employed. All examples employ small sample sizes, as well as integer data consisting of 
small numbers, in order to facilitate the reader's ability to follow the computational procedures 
to be described in Section IV. 

III. Null versus alternative hypotheses This section contains both asymbolic and verbal 
description of the statistical hypotheses evaluated with the test (1.e., the null hypothesis versus 
thealternative hypothesis). Italso states the form the data will assume when the null hypothesis 
is supported, as opposed to when one or more of the possible alternative hypotheses are 
supported. 

IV. Test computations This section contains a step-by-step description of the procedure 
for computing the test statistic. The computational guidelines are clearly outlined in reference 
to the data for the example(s) presented in Section II. 

V. Interpretation of the test results This section describes the protocol for evaluating the 
computed test statistic. Specifically: a) It provides clear guidelines for employing the appropriate 
table of critical values to analyze the test statistic; b) Guidelines are provided delineating the 
relationship between the tabled critical values and when a researcher should retain the null 
hypothesis, as opposed to when the researcher can conclude that one or more of the possible 
alternative hypotheses are supported; c) The computed test statistic is interpreted in reference to the 
example(s) presented in Section II; and d) In instances where a parametric and nonparametric test 
can be used to evaluate the same set of data, the results obtained using both procedures are 
compared with one another, and the relative power of both tests is discussed in this section and/or 
in Section VI. 

VI. Additional analytical procedures for the test and/or related tests Since many of the 
tests described in the Handbook have additional analytical procedures associated with them, such 
procedures are described in this section. Many of these procedures are commonly employed (such 
as comparisons conducted following an analysis of variance), while others are used and/or 
discussed less frequently (such as the tie correction employed for the large sample normal 
approximation of many nonparametric test statistics). Many of the analytical procedures covered 
in Section VI are not discussed (or if so, only discussed briefly) in other books. Some repre- 
sentative topics which are covered in Section VI are planned versus unplanned comparison 
procedures, measures of association for inferential statistical tests, computation of confidence 
intervals, and computation of power. In addition to the aforementioned material, for many of the 
tests there is additional discussion of other statistical procedures that are directly related to the test 
under discussion. In instances where two or more tests produce equivalent results, examples are 
provided which clearly demonstrate the equivalency of the procedures. 

VII. Additional discussion of the test Section VII discusses theoretical concepts and 
issues, as well as practical and procedural issues that are relevant to a specific test. In some 
instances where a subject is accorded brief coverage in the initial material presented on the test, 
the reader is alerted to the fact that the subject is discussed in greater depth in Section VII. Many 
of the issues discussed in this section are topics that are generally not covered in other books, or if 
they are, they are only discussed briefly. Among the topics covered in Section VII is additional 
discussion of the relationship between a specific test and other tests that are related to it. Section 
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VII also provides bibliographic information on less commonly employed alternative procedures 
that can be used to evaluate the same design for which the test under discussion is used. 

VIII. Additional examples illustrating the use of the test This section provides 
descriptions of one or more additional experiments for which a specific test is applicable. For the 
most part, these examples employ the same data set as that in the original example(s) presented 
in Section II for that test. By virtue of using standardized data for most of the examples, the 
material for a test contained in Section IV (Test computations) and Section V (Interpretation 
of the test results) will be applicable to most of the additional examples. Because of this, the 
reader is able to focus on common design elements in various experiments which indicate that a 
given test is appropriate for use with a specific type of design. 

IX. Addendum At the conclusion of the discussion of a number of tests an Addendum has 
been included that describes one or more related tests that are not discussed in Section VI. As an 
example, the Addendum of the between-subjects factorial analysis of variance contains an 
overview and computational guidelines for the factorial analysis of variance for a mixed design 
and the within-subjects factorial analysis of variance. 

References This section provides the reader with a listing of primary and secondary source 
material on each test. 

Endnotes At the conclusion of most tests, a detailed endnotes section contains additional 
useful information that further clarifies or expands upon material discussed in the main text. 


The first edition of the Handbook of Parametric and Nonparametric Statistical Pro- 
cedures was comprised of an Introduction followed by 26 chapters, each of which documented 
a specific inferential statistical test (as well as related tests) or measure of correlation/association. 

The general label Test was used (and is used in this edition) for all procedures described in the 
book (i.e., inferential tests as well as measures of correlation/association). In addition to the 
Introduction, the second edition of the Handbook contains 32 chapters. A chapter describing in 
detail each of the following six tests has been added the second edition: a) The single-sample test 
for evaluating population skewness (Test 4); b) The single-sample test for evaluating 
population kurtosis (Test 5) (The D’Agostino—Pearson test of normality (Test 5A) is also 
described in this chapter); c) The Kolmogorov-Smirnov goodness-of-fit test for a single sample 
(Test 7) (The Lilliefors test for normality (Test 7a) is also described in this chapter); d) The 
Kolmogorov-Smirnov test for two independent samples (Test 13); e) The Moses test for 
equal variability (Test 15); and f) The van der Waerden normal-scores test for k independent 
samples (Test 23). In addition to the aforementioned tests, a substantial amount of new material 
has been added to tests that were included in the first edition. Chapters/Tests included in the first 
edition are noted below, indicating subject matter that has been added to the second edition. 

Introduction: Description and computation of the coefficient of variation; extensive 
coverage of skewness and kurtosis, including description and computation of the Pearsonian 
coefficient of skewness, the g, and yo, measures of skewness, and the g, and b, measures of 
kurtosis. 

The chi-square goodness-of-fit test (Test 8): Illustration of the use of the chi-square 
goodness-of-fit test for assessing goodness-of-fit for a normal population distribution; discussion 
of Cohen's w index for computing the power of the chi-square goodness-of-fit test; description 
of heterogeneity chi-square analysis. Two additional examples have been added to this chapter 
to illustrate the new material. 

The binomial sign test for a single sample (Test 9): Discussion of Cohen's g index 
for computing the power of the binomial sign test for a single sample; evaluating goodness-of-fit 
for a binomial distribution. An Addendum has been added that provides comprehensive 
coverage of the following discrete probability distributions: multinomial distribution; negative 
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binomial distribution; hypergeometric distribution; Poisson distribution; and matching 
distribution. Twelve additional examples have been added to this chapter to illustrate the new 
material. 

The single-sample runs test (and other tests of randomness) (Test 10): The extension 
of the runs test to data with more than two categories is described. The runs test for serial 
randomness (Test 10a) has been added to this chapter. There is additional discussion of the 
concept of randomness. An Addendum has been added that describes in detail the generation of 
pseudorandom numbers — specifically, the following methods are described: the midsquare 
method, the midproduct method, and the linear congruential method. The Addendum also 
provides detailed coverage of the following alternative tests of randomness: The frequency test 
(Test 10b), The gap test (Test 10c), The poker test (Test 10d), The maximum test (Test 10e), 
and The mean square successive difference test (Test 10f). One additional example has been 
added to this chapter to illustrate the new material. In addition, a standardized data set is evaluated 
with four of the aforementioned tests of randomness. 

The ¢ test for two independent samples (Test 11): Comprehensive discussion of outliers 
(including Test 11e: Procedures for identifying outliers), robust statistical procedures, and 
data transformation (description of and computational examples illustrating the square root, 
logarithmic, reciprocal, and acrsine transformations); discussion of Hotelling's T?. 

The Mann-Whitney U test (Test 12): An Addendum has been added that provides 
comprehensive coverage of computer-intensive/data-driven statistical procedures/resampling 
statistics. The following topics are discussed in the Addendum: Randomization and permu- 
tation tests (including The randomization test for two independent samples (Test 12a), The 
bootstrap (Test 12b), and The jackknife (Test 12c)). Two additional examples have been added 
to this chapter to illustrate the new material. 

The chi-square test for r x c tables (Test 16): Discussion of Cohen’s w and h indices for 
computing the power of the chi-square test for r x c tables and the z test for two independent 
proportions (Testl6d); heterogeneity chi-square analysis for 2 x 2 contingency tables; 
expanded coverage of the odds ratio (Test 16j) (including discussion of the concept of relative 
risk, test of significance for an odds ratio (Test 16j-a), and computation of a confidence interval 
for an odds ratio); discussion of Simpson's Paradox; analysis of multidimensional contingency 
tables. Three additional examples have been added to this chapter to illustrate the new material. 

The McNemar test (Test 20): An Addendum has been that which describes The Bowker 
test of symmetry (Test 20a). One additional example has been added to this chapter to illustrate 
the new material. 

Thesingle-factor between-subjects analysis of variance (Test 21): Discussion of Cohen's 
findex employed in computing the power and magnitude of treatment effect for the single-factor 
between-subjects analysis of variance; discussion of multivariate analysis of variance 
(MANOVA). AnAddendum has been added that provides comprehensive coverage of the single- 
factor between-subjects analysis of covariance (Test 21j). One additional example has been 
added to this chapter to illustrate the new material. 

The Kruskal-Wallis one-way analysis of variance by ranks (Test 22): Discussion of an 
alternative pairwise multiple comparison procedure. 

The single-factor within-subjects analysis of variance (Test 24): Revised equations for 
computing the omega squared statistic for magnitude of treatment effect; discussion of Cohen's 
f index employed in computing the power and magnitude of treatment effect for the of the single- 
factor within-subjects analysis of variance; discussion of the Latin square design. 

The single-factor between-subjects factorial analysis of variance (Test 27): Revised 
equations for computing the omega squared statistic for magnitude of treatment effect; discussion 
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of Cohen’s f index employed in computing the power and magnitude of treatment effect for the 
single-factor between-subjects factorial analysis of variance. 

The Pearson product-moment correlation coefficient (Test 28): The following material 
has been added to the Addendum: Nonmathematical descriptions of the following multivariate 
procedures: Factor analysis, canonical correlation, discriminant analysis, and logistic re- 
gression; meta-analysis and related topics. (This section contains a comprehensive discussion 
of meta-analysis and includes a description of the following meta-analytic procedures: Test 28n: 
Procedure for comparing k studies with respect to significance level; Test 280: The Stouffer 
procedure for obtaining a combined significance level for k studies; Test 28p: Procedure for 
comparing k studies with respect to effect size; Test 28q: Procedure for obtaining a 
combined effect size for k studies. This section also discusses Jacob Cohen's indices for the 
power computation of various tests, and the controversy over the conventional significance test 
based hypothesis testing model versus the minimum-effect hypothesis testing model.) One 
additional example has been added to this chapter to illustrate the new material. 

In addition to the aforementioned topics, the second edition provides expanded information 
on the asymptotic relative efficiency of nonparametric statistical procedures. Allin all, 25 new 
tests have been added to the second edition along with 32 additional examples to illustrate the new 
material. 


Although it is not a prerequisite, the Handbook of Parametric and Nonparametric 
Statistical Procedures is designed to be used by those who have a basic familiarity with de- 
scriptive statistics and experimental design. Prior familiarity with the latter subject matter will 
facilitate one's ability to use the book efficiently. In order to insure that the reader has familiarity 
with these topics, an Introduction has been included which provides a general overview of 
descriptive statistics and experimental design. Following the Introduction, the reader is provided 
with guidelines and decision tables for selecting the appropriate statistical test for evaluating a 
specific experimental design. The Handbook of Parametric and Nonparametric Statistical 
Procedures can be used as a reference book or it can be employed as a textbook in undergraduate 
and graduate courses that are designed to cover a broad spectrum of parametric and/or non- 
parametric statistical procedures. 

The author would like to express his gratitude to a number of people who helped make this 
book a reality. First, I would like to thank Tim Pletscher of CRC Press for his confidence in and 
support of the first edition of the Handbook. Special thanks are due to Bob Stern, the mathematics 
editor at CRC Press, who suggested a second edition. Without his efforts and encouragement this 
book would not have become a reality. Sylvia Wood of CRC Press deserves thanks for overseeing 
the production of the final product. I am also indebted to Glena Ames who did an excellent job 
preparing the copy-ready manuscript. Finally, I must express my appreciation to my wife Vicki 
and daughter Emily, who both endured and tolerated the difficulties associated with a project of 
this magnitude. 


David Sheskin 
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1. Unequal sample sizes 
2. Robustness of the ¢ test for two independent samples 
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VIII. 


Test 12. 


I. 
II. 
III. 
IV. 
V. 
VI. 


VII. 


VIII. 
IX. 


Test 13. 


3. Outliers (Test 11e: Procedures for identifying outliers) and 
data transformation 
4. Hotelling's T? 
Additional Examples Illustrating the Use of the t Test for Two 
Independent Samples 


The Mann-Whitney U Test 


Hypothesis Evaluated with Test and Relevant Background Information 
Example 
Null versus Alternative Hypotheses 
Test Computations 
Interpretation of the Test Results 
Additional Analytical Procedures for the Mann-Whitney U Test and/or 
Related Tests 
1. The normal approximation of the Mann-Whitney U statistic 
for large sample sizes 
2. The correction for continuity for the normal approximation 
of the Mann-Whitney U test 
3. Tie correction for the normal approximation of the Mann- 
Whitney U statistic 
4. Sources for computing a confidence interval for the Mann- 
Whitney U test 
Additional Discussion of the Mann-Whitney U Test 
1. Power-efficiency of the Mann-Whitney U test 
2. Equivalency of the normal approximation of the Mann- Whitney 
U test and the t test for two independent samples with rank-orders 


3. Alternative nonparametric rank-order procedures for evaluating a design 
involving two independent samples 
Additional Examples Illustrating the Use of the Mann-Whitney U Test 
Addendum 
1. Computer-intensive tests (Randomization and permutation 
tests; Test 12a: The randomization test for two 
independent samples; Test 12b: The bootstrap; Test 12c: 
The jackknife; Final comments on computer-intensive 
procedures) 


The Kolmogorov-Smirnov Test for Two Independent Samples 


. Hypothesis Evaluated with Test and Relevant Background Information 

II. 
III. 
IV. 
. Interpretation of the Test Results 
VI. 


Example 
Null versus Alternative Hypotheses 
Test Computations 


Additional Analytical Procedures for the Kolmogorov-Smirnov test for 
two independent samples 
1. Graphical method for computing the Kolmogorov-Smirnov 
test statistic 
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2. Computing sample confidence intervals for the Kolmogorov- 
Smirnov test for two independent samples 
3. Large sample chi-square approximation for a one-tailed 
analysis for the Kolmogorov-Smirnov test for two 
independent samples 
VII. Additional Discussion of the Kolmogorov-Smirnov Test for Two 
Independent Samples 
1. Additional comments on the Kolmogorov-Smirnov test for two 
independent samples 
VIII. Additional Examples Illustrating the Use of the Kolmogorov-Smirnov 
Test for Two Independent Samples 


Test 14. The Siegel- Tukey Test for Equal Variability 


I. Hypothesis Evaluated with Test and Relevant Background Information 
II. Example 
III. Null versus Alternative Hypotheses 
IV. Test Computations 
V. Interpretation of the Test Results 
VI. Additional Analytical Procedures for the Siegel- Tukey Test for Equal 
Variability and/or Related Tests 

1. The normal approximation of the Siegel-Tukey test statistic 
for large sample sizes 

2. The correction for continuity for the normal approximation 
of the Siegel-Tukey test for equal variability 

3. Tie correction for the normal approximation of the Siegel- 
Tukey test statistic 

4. Adjustment of scores for the Siegel-Tukey test for equal 
variability when 0, + 0, 

VII. Additional Discussion of the Siegel- Tukey Test for Equal Variability 

1. Analysis of the homogeneity of variance hypothesis for the 
same set of data with both a parametric and nonparametric 
test, and the power-efficiency of the Siegel-Tukey Test for 
Equal Variability 

2. Alternative nonparametric tests of dispersion 

VIII. Additional Examples Illustrating the Use of the Si egel-Tukey Test for 
Equal Variability 


Test 15. The MosesTest for Equal Variability 


= 


. Hypothesis Evaluated with Test and Relevant Background Information 
II. Example 
III. Null versus Alternative Hypotheses 
IV. Test Computations 
V. Interpretation of the Test Results 
VI. Additional Analytical Procedures for the Moses Test for Equal 
Variability and/or Related Tests 
1. The normal approximation of the Moses test statistic for large 
sample sizes 
VII. Additional Discussion of the Moses Test for Equal Variability 
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1. Power-efficiency of the Moses Test for equal variability 
2. Issue of repetitive resampling 
3. Alternative nonparametric tests of dispersion 
VIII. Additional Examples Illustrating the Use of the Moses Test for Equal 
Variability 


Test 16. The Chi-Square Test for r x c Tables [Test 16a: The Chi-Square 
Test for Homogeneity; Test 16b: The Chi-Square Test of 
Independence (employed with a single sample)] 


= 


. Hypothesis Evaluated with Test and Relevant Background Information 
II. Examples 
III. Null versus Alternative Hypotheses 
IV. Test Computations 
V. Interpretation of the Test Results 
VI. Additional Analytical Procedures for the Chi-Square Test for r x c 
Tables and/or Related Tests 
1. Yates’ correction for continuity 
2. Quick computational equation for a 2 x 2 table 
3. Evaluation of a directional alternative hypothesis in the case 
of a 2 x 2 contingency table 
4. Test 16c: The Fisher exact test 
5. Test 16d: The z test for two independent proportions 
6. Computation of a confidence interval for a difference 
between proportions 
7. Test 16e: The median test for independent samples 
8. Extension of the chi-square test for r x c tables to 
contingency tables involving more than two rows and/or 
columns, and associated comparison procedures 
9. The analysis of standardized residuals 
10. Sources for computing the power of the chi-square test for 
r x c tables 
11. Heterogeneity chi-square analysis for a 2 x 2 contingency 
table 
12. Measures of association for r x c contingency tables (Test 
16f: The contingency coefficient; Test 16g: The phi 
coefficient; Test 16h: Cramér's phi coefficient; Test 16i: 
Yule's Q; Test 16j: The odds ratio (and the concept of 
relative risk; Test 16j-a: Test of significance for an odds 
ratio and computation of a confidence interval for an odds 
ratio) 
VII. Additional Discussion of the Chi-Square Test for r x c Tables 
1. Simpson's Paradox 
2. Analysis of multidimensional contingency tables 
VIII. Additional Examples Illustrating the Use of the Chi-Square Test for 
r x c Tables 
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Inferential Statistical Tests Employed with Two Dependent 
Samples (and Related Measures of Association/Correlation) 


Test 17. 


I. 
II. 
III. 
IV. 
V. 
VI. 


VII. 


VIII. 


Test 18. 


III. 
IV. 


VI. 


The ¢ Test for Two Dependent Samples 


Hypothesis Evaluated with Test and Relevant Background Information 
Example 
Null versus Alternative Hypotheses 
Test Computations 
Interpretation of the Test Results 
Additional Analytical Procedures for the ¢ Test for Two Dependent 
Samples and/or Related Tests 
1. Alternative equation for the t test for two dependent samples 
2. The equation for the t test for two dependent samples when 
a value for a difference other than zero is stated in the null 
hypothesis 
3. Test 17a: The ¢ test for homogeneity of variance for two dependent 
samples: Evaluation of the homogeneity of variance assumption of 
the t test for two dependent 
samples 
4. Computation of the power of the t test for two dependent samples and 
the application of Test 17b: Cohen's d index 
5. Measure of magnitude of treatment effect for the ¢ test for two 
dependent samples: Omega squared (Test 17c) 
6. Computation of a confidence interval for the t test for two dependent 
samples 
7. Test 17d: Sandler's A test 
8. Test 17e: The z test for two dependent samples 
Additional Discussion of the t Test for Two Dependent Samples 
1. The use of matched subjects in a dependent samples design 
2. Relative power of the t test for two dependent samples and 
the t test for two independent samples 
3. Counterbalancing and order effects 
4. Analysis of a before-after design with the t test for two 
dependent samples 
Additional Example Illustrating the Use of the t Test for Two 
Dependent Samples 


The Wilcoxon Matched-Pairs Signed-Ranks Test 


. Hypothesis Evaluated with Test and Relevant Background Information 
. Example 


Null versus Alternative Hypotheses 
Test Computations 


. Interpretation of the Test Results 


Additional Analytical Procedures for the Wilcoxon Matched-Pairs 
Signed-Ranks Test and/or Related Tests 
1. The normal approximation of the Wilcoxon T statistic for large 
sample sizes 
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VII. 


VIII. 


Test 19. 


I. 
II. 
III. 
IV. 
V. 
VI. 


VII. 


VIII. 


Test 20. 


= 


2. The correction for continuity for the normal approximation 
of the Wilcoxon matched-pairs signed-ranks test 
3. Tie correction for the normal approximation of the Wilcoxon 
test statistic 
4. Sources for computing a confidence interval for the 
Wilcoxon matched-pairs signed-ranks test 
Additional Discussion of the Wilcoxon Matched-Pairs Signed-Ranks 
Test 
1. Power-efficiency of the Wilcoxon matched-pairs signed- 
ranks test 
2. Alternative nonparametric procedures for evaluating a design 
involving two dependent samples 
Additional Examples Illustrating the Use of the Wilcoxon Matched- 
Pairs Signed-Ranks Test 


The Binomial Sign Test for Two Dependent Samples 


Hypothesis Evaluated with Test and Relevant Background Information 
Example 
Null versus Alternative Hypotheses 
Test Computations 
Interpretation of the Test Results 
Additional Analytical Procedures for the Binomial Sign Test for Two 
Dependent Samples and/or Related Tests 
1. The normal approximation of the binomial sign test for two 
dependent samples with and without a correction for 
continuity 
2. Computation of a confidence interval for the binomial sign 
test for two dependent samples 
3. Sources for computing the power of the binomial sign test for 
two dependent samples, and comments on asymptotic relative 
efficiency of the test 
Additional Discussion of the Binomial Sign Test for Two Dependent 
Samples 
1. The problem of an excessive number of zero difference 
scores 
2. Equivalency of the Friedman two-way analysis variance 
by ranks and the binomial sign test for two dependent samples 
when k=2 
Additional Examples Illustrating the Use of the Binomial Sign Test for 
Two Dependent Samples 


The McNemar Test 


. Hypothesis Evaluated with Test and Relevant Background Information 

II. 
III. 
IV. 
. Interpretation of the Test Results 


Examples 
Null versus Alternative Hypotheses 
Test Computations 
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VI. 


VII. 


VIII. 
IX. 


Additional Analytical Procedures for the McNemar Test and/or Related 
Tests 
1. Alternative equation for the McNemar test statistic based on 
the normal distribution 
2. The correction for continuity for the McNemar test 
3. Computation of the exact binomial probability for the 
McNemar test model with a small sample size 
4. Additional analytical procedures for the McNemar test 
Additional Discussion of the McNemar Test 
1. Alternative format for the McNemar test summary table and 
modified test equation 
2. Alternative nonparametric procedures for evaluating a design 
with two dependent samples involving categorical data 
Additional Examples Illustrating the Use of the McNemar Test 
Addendum 
1. Extension of the McNemar test model beyond 2 x 2 
contingency tables (Test 20a: The Bowker test of 
symmetry) 


Inferential Statistical Tests Employed with Two or More 
Independent Samples (and Related Measures of 
Association/Correlation) 


Test 21. 


I. 
II. 
III. 
IV. 
V. 
VI. 


The Single-Factor Between-Subjects Analysis of Variance 


Hypothesis Evaluated with Test and Relevant Background Information 
Example 
Null versus Alternative Hypotheses 
Test Computations 
Interpretation of the Test Results 
Additional Analytical Procedures for the Single-Factor Between- 
Subjects Analysis of Variance and/or Related Tests 
1. Comparisons following computation of the omnibus F value 
for the single-factor between-subjects analysis of variance 
(Planned versus unplanned comparisons; Simple versus 
complex comparisons; Linear contrasts; Orthogonal 
comparisons; Test 21a: Multiple ¢ tests/Fisher's LSD 
test; Test 21b: The Bonferroni-Dunn test; Test 21c: 
Tukey's HSD test; Test 21d: The Newman-Keuls test; 
Test 21e: The Scheffé test; Test 21f: The Dunnett 
test; Additional discussion of comparison procedures and final 
recommendations; The computation of a confidence interval 
for a comparison) 
2. Comparing the means of three or more groups when k > 4 
3. Evaluation of the homogeneity of variance assumption of the 
single-factor between-subjects analysis of variance 
4. Computation of the power of the single-factor between- 
subjects analysis of variance 
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VII. 


VIII. 


IX. 


Test 22. 


I. 
II. 
III. 
IV. 
V. 
VI. 


VII. 


5. Measures of magnitude of treatment effect for the single- 
factor between-subjects analysis of variance: Omega 
squared (Test 21g), Eta squared (Test 21h), and Cohen's 
f index (Test 21i) 
6. Computation of a confidence interval for the mean of a 
treatment population 
Additional Discussion of the Single-Factor Between-Subjects Analysis 
of Variance 
1. Theoretical rationale underlying the single-factor between- 
subjects analysis of variance 
2. Definitional equations for the single-factor between-subjects 
analysis of variance 
3. Equivalency of the single-factor between-subjects analysis of 
variance and the t test for two independent samples when 
k=2 
4. Robustness of the single-factor between-subjects analysis of 
variance 
5. Fixed-effects versus random-effects models for the single- 
factor between-subjects analysis of variance 
6. Multivariate analysis of variance (MANOVA) 
Additional Examples Illustrating the Use of the Single-Factor Between- 
Subjects Analysis of Variance 
Addendum 
1. Test 21j: The Single-Factor Between-Subjects Analysis of 
Covariance 


The Kruskal-Wallis One-Way Analysis of Variance by Ranks 


Hypothesis Evaluated with Test and Relevant Background Information 
Example 
Null versus Alternative Hypotheses 
Test Computations 
Interpretation of the Test Results 
Additional Analytical Procedures for the Kruskal-Wallis One-Way 
Analysis of Variance by Ranks and/or Related Tests 
1. Tie correction for the Kruskal-Wallis one-way analysis of 
variance by ranks 
2. Pairwise comparisons following computation of the test 
statistic for the Kruskal-Wallis one-way analysis of variance 
by ranks 
Additional Discussion of the Kruskal-Wallis One-Way Analysis of 
Variance by Ranks 
1. Exact tables of the Kruskal-Wallis distribution 
2. Equivalency of the Kruskal-Wallis one-way analysis of 
variance by ranks and the Mann-Whitney U test when k = 2 
3. Power-efficiency of the Kruskal-Wallis one-way analysis of 
variance by ranks 
4. Alternative nonparametric rank-order procedures for 
evaluating a design involving k independent samples 
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VIII. 


Test 23. 


VII. 


VIII. 


Additional Examples Illustrating the Use of the Kruskal-Wallis One- 
Way Analysis of Variance by Ranks 


The Van der Waerden Normal-Scores Test for k Independent 
samples 


. Hypothesis Evaluated with Test and Relevant Background Information 

II. 
III. 
IV. 
. Interpretation of the Test Results 
VI. 


Example 
Null versus Alternative Hypotheses 
Test Computations 


Additional Analytical Procedures for the van der Waerden Normal- 
Scores Test for k Independent Samples 
1. Pairwise comparisons following computation of the test 
statistic for the van der Waerden normal-scores test for k 
independent samples 
Additional Discussion of the van der Waerden Normal-Scores Test for 
k Independent Samples 
1. Alternative normal-scores tests 
Additional Examples Illustrating the Use of the van der Waerden 
Normal-Scores Test for k Independent Samples 


Inferential Statistical Tests Employed with Two or More 
Dependent Samples (and Related Measures of 
Association/Correlation) 


Test 24. 


I. 
II. 
III. 
IV. 
V. 
VI. 


The Single-Factor Within-Subjects Analysis of Variance 


Hypothesis Evaluated with Test and Relevant Background Information 
Example 
Null versus Alternative Hypotheses 
Test Computations 
Interpretation of the Test Results 
Additional Analytical Procedures for the Single-Factor Within-Subjects 
Analysis of Variance and/or Related Tests 
1. Comparisons following computation of the omnibus F value 
for the single-factor within-subjects analysis of variance 
(Test 24a: Multiple ¢ tests/Fisher's LSD test; Test 24b: 
The Bonferroni-Dunn test; Test 24c: Tukey's HSD 
test; Test 24d: The Newman-Keuls test; Test 24e: The 
Scheffé test; Test 24f: The Dunnett test; The computation 
of a confidence interval for a comparison; Alternative 
methodology for computing MS... for a comparison) 
2. Comparing the means of three or more conditions when k > 4 
3. Evaluation of the sphericity assumption underlying the 
single-factor within-subjects analysis of variance 
4. Computation of the power of the single-factor within-subjects 
analysis of variance 
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5. Measures of magnitude of treatment effect for the single- 
factor within-subjects analysis of variance: Omega squared 
(Test 24g) and Cohen's f index (Test 24h) 

6. Computation of a confidence interval for the mean of a 
treatment population 

VII. Additional Discussion of the Single-Factor Within-Subjects Analysis of 
Variance 

1. Theoretical rationale underlying the single-factor within- 
subjects analysis of variance 

2. Definitional equations for the single-factor within-subjects 
analysis of variance 

3. Relative power of the single-factor within-subjects analysis 
of variance and the single-factor between-subjects analysis of 
variance 

4. Equivalency of the single-factor within-subjects analysis 
of variance and the t test for two dependent samples when 
ae 

5. The Latin Square design 

VIII. Additional Examples Illustrating the Use of the Single-Factor Within- 
Subjects Analysis of Variance 


Test 25. The Friedman Two-Way Analysis of Variance by Ranks 


I. Hypothesis Evaluated with Test and Relevant Background Information 
II. Example 
III. Null versus Alternative Hypotheses 
IV. Test Computations 
V. Interpretation of the Test Results 
VI. Additional Analytical Procedures for the Friedman Two-Way Analysis 
Variance by Ranks and/or Related Tests 
1. Tie correction for the Friedman two-way analysis variance by 
ranks 
2. Pairwise comparisons following computation of the test 
statistic for the Friedman two-way analysis of variance by 
ranks 
VII. Additional Discussion of the Friedman Two-Way Analysis Variance by 
Ranks 
1. Exact tables of the Friedman distribution 
2. Equivalency of the Friedman two-way analysis variance by 
ranks and the binomial sign test for two dependent samples 
whenkz2 
3. Power-efficiency of the Friedman two-way analysis variance 
by ranks 
4. Alternative nonparametric rank-order procedures for 
evaluating a design involving k dependent samples 
5. Relationship between the Friedman two-way analysis of 
variance by ranks and Kendall’s coefficient of concordance 
VIII. Additional Examples Illustrating the Use of the Friedman Two-Way 
Analysis of Variance by Ranks 
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Test 26. 


I. 
II. 
III. 
IV. 
V. 
VI. 


VII. 


VIII. 


The Cochran Q Test 


Hypothesis Evaluated with Test and Relevant Background Information 
Example 
Null versus Alternative Hypotheses 
Test Computations 
Interpretation of the Test Results 
Additional Analytical Procedures for the Cochran Q Test and/or 
Related Tests 
1. Pairwise comparisons following computation of the test 
statistic for the Cochran Q test 
Additional Discussion of the Cochran Q Test 
1. Issues relating to subjects who obtain the same score under 
all of the experimental conditions 
2. Equivalency of the Cochran Q test and the McNemar test 
whenkz2 
3. Alternative nonparametric procedures for categorical data for 
evaluating a design involving k dependent samples 
Additional Examples Illustrating the Use of the Cochran Q Test 


Inferential Statistical Test Employed with Factorial Design 
(and Related Measures of Association/Correlation) 


Test 27. 


I. 
II. 
III. 
IV. 
V. 
VI. 


The Between-Subjects Factorial Analysis of Variance 


Hypothesis Evaluated with Test and Relevant Background Information 
Example 
Null versus Alternative Hypotheses 
Test Computations 
Interpretation of the Test Results 
Additional Analytical Procedures for the Between-Subjects Factorial 
Analysis of Variance and/or Related Tests 
1. Comparisons following computation of the F values for 
the between-subjects factorial analysis of variance (Test 
27a: Multiple ¢ tests/Fisher's LSD test; Test 27b: The 
Bonferroni-Dunn test; Test 27c: Tukey's HSD test; Test 
27ad: The Newman-Keuls test; Test 27e: The Scheffé 
test; Test 27f: The Dunnett test; Comparisons between the 
marginal means; Evaluation of an omnibus hypothesis 
involving more than two marginal means; Comparisons 
between specific groups that are a combination of both 
factors; The computation of a confidence interval for a 
comparison; Analysis of simple effects) 
2. Evaluation of the homogeneity of variance assumption of the 
between-subjects factorial analysis of variance 
3. Computation of the power of the between-subjects factorial 
analysis of variance 


€ 2000 by Chapman & Hall/CRC 


4. Measures of magnitude of treatment effect for the between-subjects 
factorial analysis of variance: Omega squared (Test 27g) and 
Cohen's f index (Test 27h) 

5. Computation of a confidence interval for the mean of a 
population represented by a group 

6. Additional analysis of variance procedures for factorial designs 

VII. Additional Discussion of the Between-Subjects Factorial Analysis of 
Variance 

1. Theoretical rationale underlying the between-subjects factorial 
analysis of variance 

2. Definitional equations for the between-subjects factorial 
analysis of variance 

3. Unequal sample sizes 

4. Final comments on the between-subjects factorial analysis of 
variance (Fixed-effects versus random-effects versus mixed- 
effects models; Nested factors/hierarchical designs and designs 
involving more than two factors) 

VIII. Additional Examples Illustrating the Use of the Between-Subjects 
Factorial Analysis of Variance 
IX. Addendum 

1. Discussion of and computational procedures for additional 
analysis of variance procedures for factorial designs: Test 27i: 
The factorial analysis of variance for a mixed design; Test 
27j: The within-subjects factorial analysis of variance 


Measures of Association/Correlation 


Test 28. The Pearson Product-Moment Correlation Coefficient 
I. Hypothesis Evaluated with Test and Relevant Background Information 
II. Example 
III. Null versus Alternative Hypotheses 
IV. Test Computations 
V. Interpretation of the Test Results (Test 28a: Test of significance for 
a Pearson product-moment correlation coefficient; The coefficient 
of determination) 
VI. Additional Analytical Procedures for the Pearson Product-Moment 
Correlation Coefficient and/or Related Tests 
1. Derivation of a regression line 
2. The standard error of estimate 
3. Computation of a confidence interval for the value of the 
criterion variable 
4. Computation of a confidence interval for a Pearson product- 
moment correlation coefficient 
5. Test 28b: Test for evaluating the hypothesis that the true 
population correlation is a specific value other than zero 
6. Computation of power for the Pearson product-moment 
correlation coefficient 
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10. 


11. 


. Test 28c: Test for evaluating a hypothesis on whether 


there is a significant difference between two independent 
correlations 


. Test 28d: Test for evaluating a hypothesis on whether k 


independent correlations are homogeneous 


. Test 28e: Test for evaluating the null hypothesis H,: p,; 


= Pyz 

Tests for evaluating a hypothesis regarding one or more 
regression coefficients (Test 28f: Test for evaluating the null 
hypothesis H,: B = 0; Test 28g: Test for evaluating the null 
hypothesis Ho: p, = B,) 

Additional correlational procedures 


VII. Additional Discussion of the Pearson Product-Moment Correlation 
Coefficient 


1. 


uw 


6 


The definitional equation for the Pearson product-moment 
correlation coefficient 


. Residuals 
. Covariance 
. The homoscedasticity assumption of the Pearson product- 


moment correlation coefficient 


. The phi coefficient as a special case of the Pearson product- 


moment correlation coefficient 
Autocorrelation/serial correlation 


VIII. Additional Examples Illustrating the Use of the Pearson Product- 
Moment Correlation Coefficient 
IX. Addendum 


1. 


Bivariate measures of correlation that are related to the Pearson 
product-moment correlation coefficient (Test 28h: The point- 
biserial correlation coefficient (and Test 28h-a: Test of 
significance for a point-biserial correlation coefficient); 
Test 28i: The biserial correlation coefficient (and Test 28i- 

a: Test of significance for a biserial correlation coefficient); 
Test 28j: The tetrachoric correlation coefficient (and Test 
28j-a: Test of significance for a tetrachoric correlation 
coefficient)) 


. Multiple regression analysis (General introduction to multiple 


regression analysis; Computational procedures for multiple 
regression analysis involving three variables: Test 28k: 
The multiple correlation coefficient; The coefficient of 
multiple determination; Test 28k-a: Test of significance for a 
multiple correlation coefficient; The multiple regression 
equation; The standard error of multiple estimate; Computation 
of a confidence interval for Y'; Evaluation of the relative 
importance of the predictor variables; Evaluating the 
significance of a regression coefficient; Computation of a 
confidence interval for a regression coefficient; Partial 
and semipartial correlation (Test 281: The partial correlation 
coefficient and Test 281-a: Test of significance for a partial 
correlation coefficient; Test 28m: The Ssemipartial 
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Test 29. 


I. 
II. 
III. 
IV. 
V. 


VI. 


correlation coefficient and Test 28m-a: Test of significance 
for a semipartial correlation coefficient); Final comments on 
multiple regression analysis) 


. Additional multivariate procedures involving correlational 


analysis (Factor analysis; Canonical correlation; Discriminant 
analysis and logistic regression) 


. Meta-analysis and related topics (Measures of effect size; meta- 


analytic procedures (Test 28n: Procedure for comparing k 
studies with respect to significance level; Test 280: The 
Stouffer procedure for obtaining a combined significance 
level (p value) for k studies; The file drawer problem; Test 
28p: Procedure for comparing k studies with respect to 
effect size; Test 28q: Procedure for obtaining a combined 
effect size for k studies); Practical implications of magnitude of 
effect size value; The significance test controversy; The 
minimum-effect hypothesis testing model) 


Spearman's Rank-Order Correlation Coefficient 


Hypothesis Evaluated with Test and Relevant Background Information 
Example 

Null versus Alternative Hypotheses 

Test Computations 

Interpretation of the Test Results (Test 29a: Test of significance for 
Spearman's rank-order correlation coefficient) 

Additional Analytical Procedures for Spearman's Rank-Order 
Correlation Coefficient and/or Related Tests 


1. 


Tie correction for Spearman’s rank-order correlation 
coefficient 


. Spearman’s rank-order correlation coefficient as a special case 


of the Pearson product-moment correlation coefficient 


. Regression analysis and Spearman’s rank-order correlation 


coefficient 


. Partial rank correlation 
. Use of Fisher’s z, transformation with Spearman’s rank- 


order correlation coefficient 


VII. Additional Discussion of Spearman’s Rank-Order Correlation 
Coefficient 


1. 


The relationship between Spearman’s rank-order correlation 
coefficient, Kendall’s coefficient of concordance, and the 
Friedman two-way analysis of variance by ranks 


. Power efficiency of Spearman’s rank-order correlation 


coefficient 


. Brief discussion of Kendall’s tau: An alternative measure of 


association for two sets of ranks 


4. Weighted rank/top-down correlation 
VIII. Additional Examples Illustrating the Use of the Spearman’s Rank-Order 
Correlation Coefficient 
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Test 30. Kendall's Tau 


I. Hypothesis Evaluated with Test and Relevant Background Information 
II. Example 
III. Null versus Alternative Hypotheses 
IV. Test Computations 
V. Interpretation of the Test Results (Test 30a: Test of significance for 
Kendall’s tau) 
VI. Additional Analytical Procedures for Kendall’s Tau and/or Related Tests 
1. Tie correction for Kendall’s tau 
2. Regression analysis and Kendall’s tau 
3. Partial rank correlation 
4. Sources for computing a confidence interval for Kendall’s tau 
VII. Additional Discussion of Kendall’s Tau 
1. Power efficiency of Kendall’s tau 
2. Kendall’s coefficient of agreement 
VIII. Additional Examples Illustrating the Use of Kendall's Tau 


Test 31. Kendall’s Coefficient of Concordance 


I. Hypothesis Evaluated with Test and Relevant Background Information 
II. Example 
III. Null versus Alternative Hypotheses 
IV. Test Computations 
V. Interpretation of the Test Results (Test 31a: Test of significance for 
Kendall’s coefficient of concordance) 
VI. Additional Analytical Procedures for Kendall’s Coefficient of 
Concordance and/or Related Tests 
1. Tie correction for Kendall’s coefficient of concordance 
VII. Additional Discussion of Kendall’s Coefficient of Concordance 
1. Relationship between Kendall’s coefficient of concordance and 
Spearman’s rank-order correlation coefficient 
2. Relationship between Kendall’s coefficient of concordance and 
the Friedman two-way analysis of variance by ranks 
3. Weighted rank/top-down concordance 
VIII. Additional Examples Illustrating the Use of Kendall’s Coefficient of 
Concordance 


Test 32. Goodman and Kruskal’s Gamma 


I. Hypothesis Evaluated with Test and Relevant Background Information 
II. Example 
III. Null versus Alternative Hypotheses 
IV. Test Computations 
V. Interpretation of the Test Results (Test 32a: Test of significance for 
Goodman and Kruskal's gamma) 
VI. Additional Analytical Procedures for Goodman and Kruskal's Gamma 
and/or Related Tests 
1. The computation of a confidence interval for the value of 
Goodman and Kruskal's gamma 
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2. Test 32b: Test for evaluating the null hypothesis H, y, = y, 

3. Sources for computing a partial correlation coefficient for 
Goodman and Kruskal's gamma 

VII. Additional Discussion of Goodman and Kruskal's Gamma 

1. Relationship between Goodman and Kruskal's gamma and 
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Introduction 


Although it is assumed that the reader has prior familiarity with these topics, the intent of this 
Introduction is to provide a general overview of the basic terminology employed within the 
areas of descriptive statistics and experimental design. It will also review basic concepts that are 
required for both understanding and using the statistical procedures that are described in this 
book. Following the Introduction is an outline of all the procedures that are covered, as well 
as decision-making charts to aid the reader in selecting the appropriate statistical procedure. 


Descriptive versus Inferential Statistics 


Statistics is a field within mathematics that involves the summary and analysis of data. The field 
of statistics can be divided into two general areas, descriptive statistics and inferential 
statistics. Descriptive statistics is a branch of statistics in which data are only used for 
descriptive purposes and are not employed to make predictions. Thus, descriptive statistics 
consists of methods and procedures for presenting and summarizing data. The procedures most 
commonly employed in descriptive statistics are the use of tables and graphs, and the 
computation of measures of central tendency and variability. Measures of association or 
correlation, which are covered in this book, are also categorized by some sources as descriptive 
statistical procedures, insofar as they serve to describe the relationship between two or more 
variables. 

Inferential statistics employs data in order to draw inferences (i.e., derive conclusions) or 
make predictions. Typically, in inferential statistics sample data are employed to draw inferences 
about one or more populations from which the samples have been derived. Whereas a 
population consists of the sum total of subjects or objects that share something in common with 
one another, a sample is a set of subjects or objects which have been derived from a population. 
For a sample to be useful in drawing inferences about the larger population from which it was 
drawn, it must be representative of the population. Thus, typically (although there are 
exceptions) the ideal sample to employ in research is a random sample. In a random sample, 
each subject or object in the population has an equal likelihood of being selected as a member 
of that sample. In point of fact, it would be highly unusual to find an experiment that employed 
a truly random sample. Pragmatic and/or ethical factors make it literally impossible in most 
instances to obtain random samples for research. Insofar as a sample is not random, it will limit 
the degree to which a researcher will be able to generalize one's results. Put simply, one can only 
generalize to objects or subjects that are similar to the sample employed. 


Statistic versus Parameter 


A statistic refers to a characteristic of a sample, such as the average score (also known as the 
mean). A parameter, on the other hand, refers to a characteristic of a population (such as the 
average of a whole population). In inferential statistics the computed value of a statistic (e.g., 
asample mean) is employed to make inferences about a parameter in the population from which 
the sample was derived (e.g., the population mean). The statistical procedures described in this 
book all employ data derived from one or more samples, in order to draw inferences or make 
predictions with respect to the larger population(s) from which the sample(s) were drawn. 

A statistic can be employed for either descriptive or inferential purposes. An example of 
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using a statistic for descriptive purposes is obtaining the mean of a group (which represents a 
sample) in order to summarize the average performance of the group. On the other hand, if we 
use the mean of a group to estimate the mean of a larger population the group is supposed to 
represent, the statistic (1.e., the group mean) is being employed for inferential purposes. The 
most basic statistics that are employed for both descriptive and inferential purposes are measures 
of central tendency (of which the mean is an example) and measures of variability. 

When data from a sample are employed to estimate a population parameter, any statistic 
derived from the sample should be unbiased. An unbiased statistic is one that provides the most 
accurate estimate of a population parameter. A biased statistic, on the other hand, does not provide 
as accurate an estimate of that parameter. The subject of bias in statistics will be discussed later in 
reference to the mean (which is the most commonly employed measure of central tendency), and 
the variance (which is the most commonly employed measure of variability). 


Levels of Measurement 


Typically, information that is quantified in research for purposes of analysis is categorized with 
respect to the level of measurement the data represent. Different levels of measurement contain 
different amounts of information with respect to whatever the data are measuring. Statisticians 
generally conceptualize data as fitting within one of the following four measurement categories: 
nominal data (also known as categorical data), ordinal data (also know as rank-order data), 
interval data, and ratio data. As one moves from the lowest level of measurement, nominal 
data, to the highest level, ratio data, the amount of information provided by the numbers 
increases, as well the meaningful mathematical operations that can be performed on those 
numbers. Each of the levels of measurement will now be discussed in more detail. 

a) Nominal/categorical level measurement In nominal/categorical measurement, 
numbers are employed merely to identify mutually exclusive categories, but cannot be 
manipulated in a meaningful mathematical manner. As an example, a person's social security 
number represents nominal measurement, since it is used purely for purposes of identification, 
and cannot be meaningfully manipulated in a mathematical sense (i.e., adding, subtracting, etc. 
the social security numbers of people does not yield anything of tangible value). 

b) Ordinal/rank-order level measurement In an ordinal scale, the numbers represent 
rank-orders, and do not give any information regarding the differences between adjacent ranks. 
Thus, the order of finish in a horse race represents an ordinal scale. If in a race Horse A beats 
Horse B in a photo finish, and Horse B beats Horse C by twenty lengths, the respective order of 
finish of the three horses reveals nothing about the fact that the distance between the first and 
second place horses was minimal, while the difference between second and third place horses 
was substantial. 

c) Interval level measurement An interval scale not only considers the relative order of 
the measures involved (as is the case with an ordinal scale) but, in addition, is characterized by 
the fact that throughout the length of the scale, equal differences between measurements cor- 
respond to equal differences in the amount of the attribute being measured. What this translates 
to is that if IQ is conceptualized as an interval scale, the one point difference between a person 
who has an IQ of 100 and someone who has an IQ of 101 should be equivalent to the one point 
difference between a person who has an IQ of 140 and someone with an IQ of 141. In actuality 
some psychologists might argue this point, suggesting that a greater increase in intelligence is 
required to jump from an IQ of 140 to 141, than to jump from an IQ of 100 to 101. In fact, if the 
latter is true, a one point difference does not reflect the same magnitude of difference across the 
full range of the IQ scale. Although in practice IQ and most other human characteristics measured 
by psychological tests (such as anxiety, introversion—extroversion, etc.) are treated as interval 
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scales, many researchers would argue that they are more appropriately categorized as ordinal 
scales. Such an argument would be based on the fact that such measures do not really meet the 
requirements of an interval scale, because it cannot be demonstrated that equal numerical dif- 
ferences at different points of the scale are comparable. 

It should also be noted that, unlike ratio scales which will be discussed next, interval scales 
do not have a true zero point. If interval scales have a zero score that can be assigned to a person 
or object, it is assumed to be arbitrary. Thus, in the case of IQ we can ask the question of 
whether or not there is truly an IQ which is so low that it literally represents zero IQ. In reality, 
you probably can only say a person who is dead has a zero IQ! In point of fact, someone who 
has obtained an IQ of zero on an IQ test has been assigned that score because his performance 
on the test was extremely poor. The zero IQ designation does not necessarily mean the person 
could not answer any of the test questions (or, to go further, that the individual possesses none 
of the requisite skills or knowledge for intelligence). The developers of the test just decided to 
select a certain minimum score on the test and designate it as the zero IQ point. 

d) Ratio level measurement As is the case with interval level measurement, ratio level 
measurement is also characterized by the fact that throughout the length of the scale, equal dif- 
ferences between measurements correspond to equal differences in the amount of the attribute 
being measured. However, ratio level measurement is also characterized by the fact that it has 
a true zero point. Because of the latter, with ratio measurement one is able to make meaningful 
ratio statements with regard to the attribute/variable being measured. To illustrate these points, 
most physical measures such as weight, height, blood glucose level, as well as measures of 
certain behaviors such as the number of times a person coughs or the number of times a child 
cries, represent ratio scales. For all of the aforementioned measures there is a true zero point 
(i.e., zero weight, zero height, zero blood glucose, zero coughs, zero episodes of crying), and for 
each of these measures one is able to make meaningful ratio statements (such as Ann weighs 
twice as much as Joan, Bill is one-half the height of Steve, Phil's blood glucose is 100 times 
Sam's, Mary coughs five times as often as Pete, and Billy cries three times as much as Heather). 


Continuous versus Discrete Variables 


When measures are obtained on people or objects, in most instances we assume that there will 
be variability. Since we assume variability, if we are quantifying whatever it is that is being 
measured, not everyone or everything will produce the same score. For this reason, when 
something is measured it is commonly referred to as a variable. As noted above, variables can 
be categorized with respect to the level of measurement they represent. In addition, a variable 
can be categorized with respect to whether it is continuous or discrete. A continuous variable 
can assume any value within the range of scores that define the limits of that variable. A discrete 
variable, on the other hand, can assume only a limited number of values. To illustrate, tem- 
perature (which can assume both integer and fractional/decimal values within a given range) is 
a continuous variable. Theoretically there are an infinite number of possible temperature 
values, and the number of temperature values we can measure is limited only by the precision 
of the instrument we are employing to obtain the measurements. The face value of a die, on the 
other hand, is a discrete variable, since it can only assume the integer values 1 through 6. 


Measures of Central Tendency 
Earlier in the Introduction it is noted that the most commonly employed statistics are measures 


of central tendency and measures of variability. This section will describe three measures of 
central tendency: the mode, the median, and the mean. 
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The mode The mode is the most frequently occurring score in a distribution of scores. 
A mode that is derived for a sample is a statistic, whereas the mode of a population is a 
parameter. In the following distribution of scores the mode is 5, since it occurs two times, 
whereas all other scores occur only once: 0, 1, 2, 5, 5, 8, 10. If more than one score occurs with 
the highest frequency, it is possible to have two or more modes in a distribution. Thus, in the 
distribution 0, 1, 2, 5, 6, 8, 10, all of the scores represent the mode, since each score occurs one 
time. A distribution with more than one mode is referred to as a multimodal distribution. If 
it happens that two scores both occur with the highest frequency, the distribution would be 
described as a bimodal distribution, which represents one type of multimodal distribution. The 
distribution 0, 5, 5, 8, 9, 9, 12 is bimodal, since the scores 5 and 9 both occur two times, and all 
other scores appear once. 

The most common situation in which the mode is employed as a descriptive measure is 
when a large body of data is presented in a tabular format listing the frequency of each score. 
Such a table is referred to as a frequency distribution. An example of a frequency distribution 
would be if the scores 0, 5, 5, 8, 9, 9, 12 were arranged in a tabular format. Specifically, each 
score in the range 0 to 12 would be recorded in one column, and an adjacent column would list 
the frequency of occurrence for each score. An example of a frequency distribution are the first 
two columns of Table 7.1 in Section I of the Kolmogorov-Smirnov goodness-of-fit test for a 
single sample (Test 7). The latter table also contains a cumulative frequency distribution 
(which is represented by the third column of the table). In a cumulative frequency distribution, 
the frequency recorded for each score represents the frequency of a score plus the frequencies 
of all scores which are less than that score. Scores are arranged ordinally, with the lowest score 
at the bottom of the distribution, and the highest score at the top of the distribution. The 
cumulative frequency for the lowest score will simply be the frequency for that score, since there 
are no scores below it. On the other hand, the cumulative frequency for the highest score will 
always equal n, the total number of scores in the distribution. 

The median The median is the middle score in a distribution. If there is an odd number of 
scores in a distribution, in order to determine the median the following protocol should be employed: 
Divide the total number of scores by 2 and add .5 to the result of the division. The obtained value 
indicates the ordinal position of the score which represents the median of the distribution (note that 
this value does not represent the median). Thus, if we have a distribution consisting of five scores 
(e.g., 6, 8, 9, 13, 16), we divide the number of scores in the distribution by two, and add .5 to the 
result of the division. Thus, (5/2) + .5 2 3. The obtained value of 3 indicates that if the five scores 
are arranged ordinally (i.e., from lowest to highest), the median is the 3rd highest (or 3rd lowest) 
score in the distribution. With respect to the distribution 6, 8, 9, 13, 16, the value of the median will 
equal 9, since 9 is the score in the third ordinal position. 

If there are an even number of scores in a distribution, there will be two middle scores. The 
median is the average of the two middle scores. To determine the ordinal positions of the two 
middle scores, divide the total number of scores in the distribution by 2. The number value 
obtained by that division and the number value that is one above it represent the ordinal positions 
of the two middle scores. To illustrate, assume we have a distribution consisting of the following 
six scores: 6, 8,9, 12, 13, 16. To determine the median, we initially divide 6 by 2 which equals 
3. Thus, if we arrange the scores ordinally, the 3rd and 4th (since 3 + 1 = 4) scores are the 
middle scores. The average of these scores, which are, respectively, 9 and 12, is the median 
(which will be represented by the notation M). Thus, M = (9 + 12)/2 = 10.5. Note once again 
that in this example, as was the case in the previous one, the initial values computed (3 and 4) 
do not themselves represent the median, but instead represent the ordinal position of the scores 
used to compute the median. As was the case with the mode, a median value derived for a 
sample is a statistic, whereas the median of a whole population is a parameter. 
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The mean The mean, which is the most commonly employed measure of central tendency, 
is the average score in a distribution. Within the framework of the discussion to follow, the nota- 
tion n will represent the number of subjects or objects in a sample, and the notation N will represent 
the total number of subjects or objects in the population from which the sample is derived. 

Equation I.1 is employed to compute the mean of a sample. X, which is the upper case 
Greek letter sigma, is a summation sign. The notation XX indicates that the set of scores should 
be summed. 


EX 


n 


X = (Equation I.1) 


Sometimes Equation I.1 is written in the following more complex but equivalent form 
containing subscripts: X = X7 ,X,/n. In the latter equation, the notation 27 , X; indicates that 
beginning with the first score, scores 1 through 7 (i.e., all the scores) are to be summed. X, 
represents the score of the i " subject or object. 

Equation I.1 will now be applied to the following distribution of five scores: 6, 8, 9, 13, 
16. Sincen 2 5 and XX - 52, X - XXIn - 52/5 - 104. 

Whereas Equation I.1 describes how one can compute the mean of a sample, Equation I.2 
describes how one can compute the mean of a population. The simplified version without sub- 
scripts is to the right of the first = sign, and the subscripted version of the equation is to the right 
of the second = sign. The mean of a population is represented by the notation u, which is the 
lower case Greek letter mu. In practice, it would be highly unusual to have occasion to compute 
the mean of a population. Indeed, a great deal of analysis in inferential statistics is concerned 
with trying to estimate the mean of a population from the mean of a sample. 


(Equation I.2) 





Note that in the numerator of Equation I.2 all N scores in the population are summed, as 
opposed to just summing n scores when the value of X is computed. The sample mean X is an 
unbiased estimate of the population mean u, which indicates that if one has a distribution of n 
scores, X provides the best possible estimate of the true value of u. Typically, when the mean 
is used as a measure of central tendency, it is employed with interval or ratio level data. 


Measures of Variability 


In this section a number of measures of variability will be discussed. Primary emphasis, 
however, will be given to the standard deviation and the variance, which are the most 
commonly employed measures of variability. 

a) The range The range is the difference between the highest and lowest score in a 
distribution. Thus in the distribution 2, 3, 5, 6, 7, 12, the range is the difference between 12 (the 
highest score) and 2 (the lowest score). Thus: Range = 12 - 2 = 10. Some sources add one to 
the obtained value, and would thus say that the Range = 11. Although the range is employed 
on occasion for descriptive purposes, it is of little use in inferential statistics. 

b) Quantiles, percentiles, quartiles, and deciles A quantile is a measure that divides a 
distribution into equidistant percentage points. Examples of quantiles are percentiles, quartiles, 
and deciles. Percentiles divide a distribution into blocks comprised of one percentage point (or 
blocks that comprise a proportion equal to .01 of the distribution). A specific percentile value 
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corresponds to the point in a distribution at which a given percentage of scores falls at or below. 
Thus, if an IQ test score of 115 falls at the 84th percentile, it means 8446 of the population has 
an IQ of 115 or less. The term percentile rank is also employed to mean the same thing as a 
percentile — in other words, we can say that an IQ score of 115 has a percentile rank of 84%. 

Deciles divide a distribution into blocks comprised of ten percentage points (or blocks that 
comprise a proportion equal to .10 of the distribution). A distribution can be divided into ten 
deciles, the upper limits of which are defined by the 10th percentile, 20th percentile, ..., 90th 
percentile, and 100th percentile. Thus, a score that corresponds to the 10th percentile falls at the 
upper limit of the first decile of the distribution. A score that corresponds to the 20th percentile 
falls at the upper limit of the second decile of the distribution, and so on. The interdecile range 
is the difference between the scores at the 90th percentile (the upper limit of the ninth decile) and 
the 10th percentile. 

Quartiles divide a distribution into blocks comprised of 25 percentage points (or blocks 
that comprise a proportion equal to .25 of the distribution) . A distribution can be divided into 
four quartiles, the upper limits of which are defined by the 25th percentile, 50th percentile 
(which corresponds to the median of the distribution), 75th percentile, and 100th percentile. 
Thus, a score that corresponds to the 25th percentile falls at the upper limit of the first quartile 
of the distribution. A score that corresponds to the 50th percentile falls at the upper limit of the 
second quartile of the distribution, and so on. Theinterquartile range is the difference between 
the scores at the 75th percentile (which is the upper limit of the third quartile) and the 25th 
percentile. 

Infrequently, the interdecile or interquartile ranges may be employed to represent variability. 
An example of a situation where a researcher might elect to employ either of these measures to 
represent variability would be when the researcher wishes to omit a few extreme scores in a dis- 
tribution. Such extreme scores are referred to as outliers. Specifically, an outlier is a score in a 
set of data which is so extreme that, by all appearances, it is not representative of the population 
from which the sample is ostensibly derived. Since the presence of outliers can dramatically affect 
variability (as well as the value of the sample mean), their presence may lead a researcher to believe 
that the variability of a distribution might best be expressed through use of the interdecile or 
interquartile range (as well as the fact that when outliers are present, the sample median is more 
likely than the mean to be a representative measure of central tendency). Outliers are discussed 
in detail in Section VI of the t test for two independent samples (Test 11)). 

c) The variance and the standard deviation The most commonly employed measures 
of variability in both inferential and descriptive statistics are the variance and the standard devi- 
ation. These two measures are directly related to one another, since the standard deviation is the 
square root of the variance (and thus the variance is the square of the standard deviation). As is 
the case with the mean, the standard deviation and the variance are generally only employed with 
interval or ratio level data. 

The formal definition of the variance is that it is the mean of the squared difference scores 
(which are also referred to as deviation scores). This definition implies that in order to compute 
the variance of a distribution one must subtract the mean of the distribution from each score, 
square each of the difference scores, sum the squared difference scores, and divide the latter 
value by the number of scores in the distribution. The logic of this definition is reflected in the 
definitional equations which will be presented later in this section for both the variance and the 
standard deviation. 

A definitional equation for a statistic (or parameter) contains the specific mathematical 
operations that are described in the definition of that statistic (or parameter). On the other hand, 
acomputational equation for the same statistic (or parameter) does not clearly reflect the defini- 
tion of that statistic (or parameter). A computational equation, however, facilitates computation 
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of the statistic (or parameter), since it is computationally less involved than the definitional 
equation. In this book, in instances where a definitional and computational equation are available 
for computing a test statistic, the computational equation will generally be employed to facilitate 
calculations. 

The following notation will be used in the book with respect to the values of the variance 
and the standard deviation. 

o? (where c is the lower case Greek letter sigma) will represent the variance of a popu- 
lation. 

s? will represent the variance of a sample, when the variance is employed for descriptive 
purposes. s? will be a biased estimate of the population variance o? and, because of this, s? 
will generally underestimate the true value of 0°. 

$? will represent the variance of a sample, when the variance is employed for inferential 
purposes. $? will be an unbiased estimate of the population variance o?. 

o will represent the standard deviation of a population. 

s will represent the standard deviation of a sample, when the standard deviation is employed 
for descriptive purposes. s will be a biased estimate of the population standard deviation o 
and, because of this, s will generally underestimate the true value of o. 

$ will represent the standard deviation of a sample, when the standard deviation is 
employed for inferential purposes. § will be an unbiased estimate of the population standard 
deviation o.' 

Equations I.3-I.8 are employed to compute the values o?, s?, $°, o, s, and $. Note that 
in each case, two equivalent methods are presented for computing the statistic or parameter in 
question. The formula to the left is the definitional equation, whereas the formula to the right is 
the computational equation. 


2 





Ly? = EX? 
Lg 
o = EX- oo N (Equation I.3) 
N N 
Ex? - Quy 
_ 7 
s2 = MXx-Xy 20 n (Equation I.4) 
n n 
Ey? +: ÈX? 
oy 
go AXP 000 n (Equation I.5) 
n-1 n-1 
(Equation I.6) 
(Equation I.7) 
(Equation I.8) 
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When the variance or standard deviation of a sample is computed within the framework of 
an inferential statistical test, one always wants an unbiased estimate of the population variance 
or the population standard deviation. Thus, the computational form of Equation L5 will be em- 
ployed throughout this book when a sample variance is used to estimate a population variance, 
and the computational form of Equation I.8 will be employed when a sample standard deviation 
is used to estimate a population standard deviation. 

The reader should take note of the fact that some sources employ subscripted versions of 
the above equations. Thus, the computational form of Equation I.5 is often written as: 


2 


ix, 
EU 








Although the subscripted version will not be employed for computing the values of $? and 
§, in the case of some equations that are presented, subscripted versions may be employed in 
order to clarify the mathematical operations involved in computing a statistic. 

As noted previously, for the same set of data the value of 5? will always be larger than the 
value of s?. This can be illustrated with a distribution consisting of the five scores: 6, 8, 9, 13, 
16. The following values are substituted in Equations L4 and L5: XXX = 52, XX? = 606,n=5. 


2 
606 - G2 
s? 5 = 13.04 
2 
606 - G2» 
pg 95 e d68 
CES 


Since the standard deviation is the square root of the variance, we can quickly determine 
that s = ys? = /13.04 = 3.61 and $ = y5? = 163 = 4.04. Note that $? > s? and $ > s? 

Table I.1 summarizes the computation of the unbiased estimate of the population variance 
($?), employing both the definitional and computational equations. Note that in the two versions 
for computing 5$? listed for Equation I.5, the numerator values XX? - [(©X)?/n] and X(X - Xy 
are equivalent. Thus, in Table L1, the sum of the values of the last column 
L(x - Xy. = 6520, equals XX? - [(XXY//n] = 606 - [(52Y/5] = 65.20. 

The reader should take note of the following with respect to the standard deviation and the 
variance: 

a) The value of a standard deviation or a variance can never be a negative number. If a 
negative number is ever obtained for either value, it indicates a mistake has been made in the cal- 
culations. The only time the value of a standard deviation or variance will not be a positive 
number is when its value equals zero. The only instance in which the value of both the standard 
deviation and variance of a distribution will equal zero is when all of the scores in the distribution 
are identical to one another. 

b) As the value of the sample size (n) increases, the difference between the values of s? and 
$? will decrease. In the same respect, as the value of n increases, the difference between the 
values of s and § will decrease. Thus, the biased estimate of the variance (or standard deviation) 
will be more likely to underestimate the true value of the population variance (or standard 
deviation) with small sample sizes than it will with large sample sizes. 
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Table I.1 Computation of Estimated Population Variance 


X x? X (X - X) X - xy 

6 36 10.4 (6 — 10.4) = —4.4 (-4.4y. = 19.36 
8 64 10.4 (8—10.4) = 2.4 (-2.4} = 5.76 
9 81 10.4 (9 — 10.4) = -1.4 (-1.5y = 1.96 
13 169 10.4 (13 — 10.4) = 2.6 (2.6 = 6.76 
16 256 10.4 (16 — 10.4) = 5.6 (5.6 = 31.36 
YX =52 YX? =606 (xX - X) =0 Xx - X? = 65.20 

2 z 

x- COD eo - E2 

ue n 5 _ 6520 _ 163 
n-1 5-1 5-1 


c) The numerator of any of the equations employed to compute a variance or a standard 
deviation is often referred to as the sum of squares. Thus in the example in this section, the 
value of the sum of squares is 65.2, since EX? - [(XX)/n] = 606 - [(52)°/5] = 65.2. The 
denominators of both Equation I.5 and Equation I.8 are often referred to as the degrees of 
freedom (a concept that is discussed later in the book within the framework of the single-sample 
t test (Test 2)). Based on what has been said with respect to the sum of squares and the degrees 
of freedom, the variance is sometimes defined as the sum of squares divided by the degrees of 
freedom. 

d) The coefficient of variation An alternative, although infrequently employed measure 
of variability, is the coefficient of variation. Since the values of the standard deviation and vari- 
ance are a direct function of the magnitude of the scores in a sample/population, it can sometimes 
be useful to express variability in reference to the size of the mean of a distribution. By doing 
the latter, one can compare the values of the standard deviations and variances of distributions 
that have dramatically different mean values and/or employ different units of measurement. The 
coefficient of variation (represented by the notation CV) allows one to do this. The coefficient 
of variation is computed with Equation I.9. 


CV = (Equation I.9) 


><i | a 


The following should be noted with respect to Equation I.9: a) When the values of o and 
u are known, they can be employed in place of § and X; and b) Sometimes the value computed 
for CV is multiplied by 100 in order to express it as a percentage. 

Note that the coefficient of variation is nothing more than a ratio of the value of the 
standard deviation relative to the value of the mean. The larger the value of CV computed for 
a variable, the greater the degree of variability there is on that variable. Unlike the standard 
deviation and variance, the numerical value represented by CV is not in the units that are 
employed to measure the variable for which it is computed. 

To illustrate the latter, let us assume that we wish to compare the variability of income be- 
tween two countries that employ dramatically different values of currency. The mean monthly 
income in Country A is X 4 = 40 jaspars, with a standard deviation of $, = 10 jaspars. The 
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mean monthly income in Country B is X, = 2000 rocs, with a standard deviation of $, = 100 
rocs. Note that the mean and standard deviation for each country is expressed in the unit of cur- 
rency employed in that country. When we employ Equation I.9, we compute that the coefficient 
of variations for the two countries are CV, - 10/40 - .25 and CV, - 100/2000 - .05. The 
latter CV values are just simple ratios, and are not numbers based on the scale for the unit of 
currency employed in a given country. In other words, CV, = .25 is not .25 jaspars, but is simply 
the ratio .25. In the same respect CV, - .05 is not .05 rocs, but is simply the ratio .05. 
Consequently, by dividing the larger value CV, - .25 by the smaller value CV, - .05 we can 
determine that there is five times more variability in income in Country A than there is in Country 
B (ie., CV,/CV, = .25/.05 = 5). If we express our result as a percentage, we can say that there 
is 5 x 100% = 50096 more variability in income in Country A than there is in Country B. If, on 
the other hand, we had divided $, = 100 rocs by $, = 10 jaspars (ie., $,/5, = 100/10 = 10), 
we would have erroneously concluded that there is ten times (or 10 x 100 = 1000%) more 
variability in income in Country B than in Country A. The reason why the latter method results 
in a misleading conclusion is that, unlike the coefficient of variation, it fails to take into account the 
different units of currency employed in the two countries. 


Measures of Skewness and Kurtosis 


In addition to the mean and variance, there are two other measures that can provide useful 
descriptive information about a distribution. These two measures, skewness and kurtosis, rep- 
resent the third and fourth moments of a distribution. The word moment is employed to 
represent to the sum of the deviations from the mean in reference to sample size. Equations I.10 
and I.11 respectively represent the general equation for a moment. In Equation I.10,v, (wherev 
represents the lower case Greek letter nu) represents the population parameter for the i" moment 
about the mean, whereas in Equation I.11, m, represents the sample statistic for the i^ moment 
about the mean. 


y= zoe (Equation 1.10) 


fi e E (Equation 1.11) 
n 

With respect to a sample, the first moment about the mean (m) is represented by 
Equation L12. The second moment about the mean (m,, which is the sample variance) is 
represented by Equation I.13. The third moment about the mean (m,, which as noted above 
represents skewness, and is also referred to as symmetry) is represented by Equation I.14. The 
fourth moment about the mean (m,, which as noted above represents kurtosis) is represented 
by Equation I.15. 


As MX - X)_ 4 


(Equation I.12) 
n 
72 
.iMx-xy (Equation 1.13) 
n 
JONG 
m, = xo (Equation 1.14) 
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m, (Equation I.15) 

Although skewness and kurtosis are not employed for descriptive purposes as frequently 
as the mean and variance, they can provide useful information. Skewness and kurtosis are some- 
times employed within the context of determining the goodness-of-fit of data in reference to a 
specific type of distribution — most commonly the normal distribution. Tests of goodness-of-fit 
are discussed under the single-sample test for evaluating population skewness (Test 4), the 
single-sample test for evaluating population kurtosis (Test 5), the Kolmogorov-Smirnov 
goodness-of-fit test for a single sample (Test 7) and the chi-square goodness-of-fit test 
(Test 8). 

Skewness Skewness is a measure reflecting the degree to which a distribution is asym- 
metrical. A symmetrical distribution will result in two identical mirror images when it is split 
down the middle. The bell shaped or normal distribution, which will be discussed in the next 
section, is the best known example of a symmetrical distribution. When a distribution is not sym- 
metrical, a disproportionate number of scores will fall either to the left or right of the middle of 
the distribution. Figure I.1 depicts three frequency distributions, only one of which, Distribution 
A,is symmetrical. Distributions B and C are asymmetrical. 

At this point in the discussion it will be useful to review some basic material concerning 
frequency distributions. As noted above, Figure LI depicts a graph of each of three frequency 
distributions — specifically that of the symmetrical bell-shaped distribution and two skewed 
distributions. Note that a graph of a frequency distribution is comprised of two axes, a horizontal 
axis and a vertical axis. The X-axis or horizontal axis (which is often referred to as the abscissa) is 
employed to record the range of possible scores on a variable. The Y-axis or vertical axis (which is 
often referred to as the ordinate) is employed to represent the frequency (f) with which each of the 
scores noted on the X-axis occurs in the population or sample. Graphs of frequency distributions that 
are comprised of a single line/curve, such as the distributions in Figure I.1, are often referred to as 
frequency polygons. A frequency polygon, which is a series of lines connecting the different points 
in the distribution, is generally the end result of plotting a frequency distribution for a sample. When 
all the lines connecting the points are “smoothed over,” the resulting frequency distribution assumes 
the appearance of the distributions depicted in Figure I.1. A theoretical frequency distribution (or 
as it is sometimes called, a theoretical probability distribution), which any of the distributions in 
Figure I.1 could represent, is a graph of the frequencies for a population distribution. The X-axis 
represents the range of possible scores on a variable in the population, while the Y-axis represents 
the frequency with which each of the scores occurs (or sometimes the proportion/probability of 
occurrence for the scores is recorded on the Y-axis — thus the use of the term theoretical 
probability distribution). 

Before discussing the frequency distributions depicted in Figure I.1 in greater detail, it 
should be noted that there are various other methods which are employed for graphing data. 
Although these methods will not be discussed in this book (graphical methods are discussed in 
detail in most introductory books dealing with statistics), it is worth mentioning that it is 
generally a good idea for a researcher to obtain a plot (i.e., graph) of one's data prior to 
evaluating it. The reason for the latter is that a body of data may have certain characteristics 
which may be important in determining the most appropriate method of analysis. Often such 
characteristics will not be apparent to a researcher purely on the basis of cursory visual inspection 
— especially if the researcher is a novice or if there is a large amount of data. 

Returning to Figure I.1, Distribution A is a unimodal symmetrical distribution. Although 
it is possible to have a symmetrical distribution that is multimodal (i.e., a distribution that has more 
than one mode), within the framework of the discussion to follow it will be assumed that all 
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Distribution A 
(Symmetrical distribution) 


X 
Med 
Mode 
X Med Mode Mode Med X 
Distribution B Distribution C 
(Negatively skewed distribution) (Positively skewed distribution) 


Figure I.1 Symmetrical and Asymmetrical Distributions 


of the distributions discussed are unimodal. Note that the number of scores in the left and right 
tail of Distribution A are identical. The tail of a distribution refers to the upper/right and lower/ 
left extremes of the distribution. When one tail is heavier than another tail it means that a greater 
proportion of the scores fall in that tail. In Distribution A the two tails are equally weighted. 

Turning to the other two distributions, we can state that Distribution B is negatively 
skewed (or as it is sometimes called, skewed to the left) while Distribution C is positively 
skewed (or as it is sometimes called, skewed to the right). Note that in Distribution B the bulk 
of the scores fall in the right end of the distribution. This is the case, since the “hump” or upper 
part of the distribution falls to the right. The tail or lower end of the distribution is on the left 
side (thus the term skewed to the left). Distribution C, on the other hand, is positively skewed, 
since the bulk of the scores fall in the left end of the distribution. This is the case, since the 
“hump” or upper part of the distribution falls to the left. The tail or lower end of the distribution 
is on the right (thus the term skewed to the right). It should be pointed out that Distributions 
B and C represent extreme examples of skewed distributions. Thus, distributions can be 
characterized by skewness, yet not have the imbalance between the left and right tails/extremes 
that is depicted for Distributions B and C. 

As a general rule, based on whether a distribution is symmetrical, skewed negatively, or 
skewed positively, one can make a determination with respect to the relative magnitude of the 
three measures of central tendency discussed earlier in the Introduction. In a perfectly sym- 
metrical unimodal distribution the mean, median, and mode will always be the same value. In 
a skewed distribution the mean, median, and mode will not be the same value. Typically 
(although there are exceptions), in a negatively skewed distribution, the mean is the lowest value 
followed by the median and then the mode, which is the highest value. The reverse is the case 
in a positively skewed distribution, where the mean is the highest value followed by the median, 
with the mode being the lowest value. The easiest way to remember the arrangement of the three 
measures of central tendency in a skewed distribution is that they are arranged alphabetically 
moving in from the tail of the distribution to the highest point in the distribution. 

Since a measure of central tendency is supposed to reflect the most representative score for 
the distribution (although the word “tendency” implies that it may not be limited to a single 
value), the specific measure of central tendency that is employed for descriptive or inferential 
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purposes should be a function of the shape of a distribution. In the case of a unimodal 
distribution that is perfectly symmetrical, the mean (which will always be the same value as the 
median and mode) will be the best measure of central tendency to use, since it employs the most 
information. When a distribution is skewed, it is often preferable to employ the median as the 
measure of central tendency in lieu of the mean. Other circumstances where it may be more 
desirable to employ the median rather than the mean as a measure of central tendency are 
discussed in Section VI of the ¢ test for two independent samples under the discussion of 
outliers and data transformation. 

A simple method of estimating skewness for a sample is to compute the value sk, which 
represents the Pearsonian coefficient of skewness (developed in the 1890s by the English 
statistician Karl Pearson). Equation I.16 is employed to compute the value of sk, which is 
computed to be sk = 1.04 for the distribution summarized in Table I.1. The notation M in 
Equation 1.16 represents the median of the sample. 


sk = 2X M) - 3004-9) nou (Equation 1.16) 


The value of sk will fall within the range -3 to +3, with a value of 0 associated with a 
perfectly symmetrical distribution. Note that when X » M, sk will be a positive value, and the 
larger the value of sk, the greater the degree of positive skew. When X < M, sk will be a 
negative value, and the larger the absolute value? of sk, the greater the degree of negative skew. 
Note that when X = M, which will be true if a distribution is symmetrical, sk = 0. 

To illustrate the above, consider the following three distributions A, B, and C, each of 
which is comprised of 10 scores. Distribution A is symmetrical, Distribution B is negatively 
skewed, and Distribution C is positively skewed. The value of sk is computed for each dis- 
tribution. 


1) Distribution A: 0, 0, 0, 5, 5, 5, 5, 10, 10, 10 


The following sample statistics can be computed for Distribution A: X 4 = 9; M} =5; 
S, = 4.08; sk, = [3(5 - 5)]/4.08 = 0. The value sk, = 0 indicates that Distribution A is 
symmetrical. Consistent with the fact that it is symmetrical is that the values of the mean and 
median are equal. In addition, since the scores are distributed evenly throughout the distribution, 
both tails/extremes are identical in appearance. 


2) Distribution B: 0, 1, 1, 9, 9, 10, 10, 10, 10, 10 


The following sample statistics can be computed for Distribution B: X p^ ; My = 9.5; 
$5 = 4.40; sk, = [3(7 - 9.5)]/4.40 = -1.70. The negative value sk, = -1.70 indicates that 
Distribution B is negatively skewed. Consistent with the fact that it is negatively skewed is 
that the value of the mean is less than the value of the median. In addition, the majority of the 
scores (i.e., the hump) fall in the right/upper end of the distribution. The lower end of the 
distribution is the tail on the left side. 


3) Distribution C: 0, 0, 0,0,0,1, 1,9,9, 10 
The following sample statistics can be computed for Distribution C: X c7 3; Mg 7.5; 


ŝo = 4.40; sk. = [3 - .5)]/4.40 = 1.70. The positive value sk, = 1.70 indicates that 
Distribution C is positively skewed. Consistent with the fact that it is positively skewed is that 
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the value of the mean is greater than the value of the median. In addition, the majority of the 
scores (i.e., the hump) fall in the left/lower end of the distribution. The upper end of the dis- 
tribution is the tail on the right side. 

The most precise measure of skewness employs the exact value of the third moment about 
the mean, designated earlier as m, . Cohen (1996) and Zar (1999) note that the unbiased estimate 
of the population parameter estimated by m, can be computed with either Equation I.17 (which 
is the definitional equation) or Equation I.18 (which is a computational equation). 


n(x - Xp n 
imc Acus E Equation I.17 
ma a- bw -D (Equation ) 
nYX? - 3X EX? + 2x)" 
m, n (Equation 1.18) 


(n - 1)\(n - 2) 


Note that in Equation L17, the notation (X - X)? indicates that the mean is subtracted 
from each of the n scores in the distribution, each difference score is cubed, and the n cubed 
difference scores are summed. The notation XX? in Equation I.18 indicates each of the n scores 
is cubed, and the n cubed scores are summed. The notation (X)? in Equation I.18 indicates that 
the n scores are summed, and the resulting value is cubed. Note that the minimum sample size 
required to compute skewness is n = 3, since any lower value will result in a zero in the de- 
nominators of Equations I.17 and I.18, rendering them insoluble. 

Since the value computed for m, is in cubed units, the unitless statistic gy which is an esti- 
mate of the population parameter y, (where y represents the lower case Greek letter gamma), is 
commonly employed to express skewness. The value of g, is computed with Equation I.19. 

m; 


& = (Equation I.19) 


3 

When a distribution is symmetrical (about the mean), the value of g, will equal 0. When 
the value of g, is significantly above 0, a distribution will be positively skewed, and when it is 
significantly below 0, a distribution will be negatively skewed. Although the normal distribution 
is symmetrical (with 8, = 0), as noted earlier, not all symmetrical distributions are normal. 
Examples of nonnormal distributions that are symmetrical are the t distribution and the binomial 
distribution, when m, -.5 (the meaning of the notation 2, = .5 is explained in Section I of the 
binomial sign test for a single sample (Test 9)). 

Zar (1999) notes that a population parameter designated Bi (where f represents the lower 
case Greek letter beta) is employed by some sources (e.g., D’ Agostino (1970, 1986) and 
D’ Agostino et al. (1990)) to represent skewness. Equation I.20 is used to compute ybi , which 
is the sample statistic employed to estimate the value of Bi . 

(n - 2)g 1 


b = 
g yn(n - 1) 


When a distribution is symmetrical, the value of ybi will equal 0. When the value of ybi 
is significantly above 0, a distribution will be positively skewed, and when it is significantly 
below 0, a distribution will be negatively skewed. The method for determining whether a g, 
and/or (a value deviates significantly from 0 is described under the single-sample test for eval- 
uating population skewness. The results of the latter test, along with the results of the single- 
sample test for evaluating population kurtosis, are used in the D'Agostino-Pearson test of 


(Equation 1.20) 
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normality (Test 5a), which is employed to assess goodness-of-fit for normality (i.e., whether 
sample data are likely to have been derived from a normal distribution). When a distribution is 
normal, both g, and /b, will equal 0. 

At this point employing Equations I.17/1.18, 1.19, and 1.20, the values of m,, g,, and ybi 
will be computed for Distributions A, B, and C discussed earlier in this section. Tables 1.2-1.4 
summarize the computations, with the following resulting values: m, - 0, m, - -86.67, 


m, = 8667,g, = 0, g, = -102, g, = L02,and /b, = 0, Jb, = -.86, [b, = .86. 


Table I.2 Computation of Skewness for Distribution A 


X x? X X x-3Ó Q- Q- 
0 0 0 5 -5 25 -125 
0 0 0 5 -5 25 -125 
0 0 0 5 -5 25 -125 
5 25 125 5 0 0 0 
5 25 125 5 0 0 0 
5 25 125 5 0 0 0 
5 25 125 5 0 0 0 
10 100 1000 5 5 25 0 
10 100 1000 5 5 25 125 
10 100 1000 5 5 25 125 


Sums: XX = 50, XX? = 400, XX? = 3500, E(x - X) = 0, XX - Xy = 150, EX - X = 0 





Sy? SY | apg = 807 
X E Mx = 50 = 5 5, = AEN A = — 0 = 4.08 
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^ (n-10-2) (0-100-2) 


à 3 
nox? = 3XxXx? + 20xy (10)(3500) = (3)(50)(400) + d 
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m, = - -0 
4 (n - D(n - 2) (10 - 110 - 2) 





Kurtosis According to D'Agostino et al. (1990), the word kurtosis means curvature. 
Kurtosis is generally defined as a measure reflecting the degree to which a distribution is peaked. 
To be more specific, kurtosis provides information regarding the height of a distribution relative 
to the value of its standard deviation. The most common reason for measuring kurtosis is to 
determine whether data are derived from a normally distributed population. Kurtosis is often 
described within the framework of the following three general categories, all of which are de- 
picted by representative frequency distributions in Figure I.2: mesokurtic, leptokurtic, and 
platykurtic. 

A mesokurtic distribution, which has a degree of peakedness that is considered moderate, 
is represented by a normal distribution (i.e., the classic bell-shaped curve), which is depicted in 
Figure I.3. All normal distributions are mesokurtic, and the weight/thickness of the tails of a 
normal distribution is in between the weight/thickness of the tails of distributions that are lepto- 
kurtic or platykurtic. In Figure I.2, Distribution D best approximates a mesokurtic distribution. 
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Table L3 Computation of Skewness for Distribution B 


X x? X? X (X - X) (X - Xy. (X - XP 
0 0 0 7 -7 49 -343 
1 1 1 7 -6 36 -216 
1 1 1 7 -6 36 -216 
9 81 729 7 2 4 8 
9 81 729 7 2 4 8 
10 100 1000 7 3 9 27 
10 100 1000 7 3 9 27 
10 100 1000 7 3 9 27 
10 100 1000 7 3 9 27 
10 100 1000 7 3 9 27 
Sums: XX = 70, XX? = 664, XX? = 6460, X(X - X) = 0, XX - XY. = 174, X(X - XÝ = -624 
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^ č (n - D - 2) (10 - 110 - 2) 
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n 


m, - : = -86.67 
^» (n - Dm - 2) 10 - DO - 2) 





A leptokurtic distribution is characterized by a high degree of peakedness. The scores in 
a leptokurtic distribution tend to be clustered much more closely around the mean than they are 
in either a mesokurtic or platykurtic distribution. Because of the latter, the value of the standard 
deviation for a leptokurtic distribution will be smaller than the standard deviation for the latter 
two distributions (if we assume the range of scores in all three distributions is approximately the 
same). The tails ofa leptokurtic distribution are heavier/thicker than the tails of a mesokurtic dis- 
tribution. In Figure I.2, Distribution E best approximates a leptokurtic distribution. 

A platykurtic distribution is characterized by a low degree of peakedness. The scores in 
a playtykurtic distribution tend to be spread out more from the mean than they are in either a 
mesokurtic or leptokurtic distribution. Because of the latter, the value of the standard deviation 
for a platykurtic distribution will be larger than the standard deviation for the latter two distri- 
butions (if we assume the range of scores in all three distributions is approximately the same). 
The tails of a platykurtic distribution are lighter/thinner than the tails of a mesokurtic distribution. 
In Figure I.2, Distribution F best approximates a platykurtic distribution. 

Moors (1986) defines kurtosis as the degree of dispersion between the points marked off on 
the abscissa (X-axis) that correspond to u + o. Thus, with respect to the three types of distri- 
butions, we can make the statement that the range of values on the abscissa that fall between the 
population mean (u) and one standard deviation above and below the mean will be greatest for a 
platykurtic distribution and smallest for a leptokurtic distribution, with a mesokurtic distribution 
being in the middle. As will be noted later in the Introduction, in the case of a normal distribu- 
tion (which, as noted earlier, will always be mesokurtic), approximately 6896 of the scores will 
always fall between the mean and one standard deviation above and below the mean. 
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Figure L2 Representative Types of Kurtosis 


Table I.4 Computation of Skewness for Distribution C 


X Xx? x3 X X-X (X - Xy. (X - xy 
0 0 0 3 -3 9 -27 
0 0 0 3 =3 9 -27 
0 0 0 3 -3 9 -27 
0 0 0 3 -3 9 -27 
0 0 0 3 -3 9 -27 
1 1 1 3 -2 4 -8 
1 1 1 3 -2 4 -8 
9 81 729 3 6 36 216 
9 81 729 3 6 36 216 
10 100 1000 3 7 49 343 


Sums: XX = 30, XX? = 264, XX? = 2460, X(X - X) = 0, XX - XY. = 174, X(X - Xy. = 624 
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One crude way of estimating kurtosis is that if the standard deviation of a unimodal sym- 
metrical distribution is approximately one-sixth the value of the range of the distribution, the 
distribution is mesokurtic. In the case of a leptokurtic distribution, the standard deviation will 
be substantially less than one-sixth of the range, while in the case of a platykurtic distribution the 
standard deviation will be substantially greater than one-sixth of the range. To illustrate, let us 
assume that the range of values on an IQ test administered to a large sample is 90 points (e.g., 
the IQ scores fall in the range 55 to 145). If the standard deviation for the sample equals 15, the 
distribution would be mesokurtic (since 15/90 = 1/6). If the standard deviation equals 5, the 
distribution would be leptokurtic (since 5/90 = 1/18, which is substantially less than 1/6.). If the 
standard deviation equals 30, the distribution would be platykurtic (since 30/90 = 1/3, which is 
substantially greater than 1/6). 

A number of alternative measures for kurtosis have been developed, including one de- 
veloped by Moors (1988) and described in Zar (1999). The latter measure computes kurtosis by 
employing specific quantile values in the distribution. The most precise measure of kurtosis, 
however, employs the exact value of the fourth moment about the mean, designated earlier as m,. 
Cohen (1996) and Zar (1999) note that the unbiased estimate of the population parameter 
estimated by m, can be computed with either Equation I.21 (which is the definitional equation) 
or Equation I.22 (which is a computational equation). 


(Equation I.21) 
ae [XX - X60 + DIa - D] - 3X - Xxyr 
i (n - 2)(n - 3) 
02 0? + nDEX* - Aq? + YX EX - 30? - nx» + Land x bguation 02 
4 COU a ee DEUM 
n(n - 1)\(n - 2)n - 3) 


Note that in Equation I.21, the notation X (X - X )* indicates that the mean is subtracted 
from each of the n scores in the distribution, each difference score is raised to the fourth power, 
and the n difference scores raised to the fourth power are summed. The notation XX^ in Equa- 
tion I.22 indicates each of the n scores is raised to the fourth power, and the n resulting values 
are summed. The notation (XX )^ in Equation I.22 indicates that the n scores are summed, and 
the resulting value is raised to the fourth power. Note that the minimum sample size required to 
compute kurtosis is n = 4, since any lower value will result in a zero in the denominators of 
Equations I.21 and I.22, rendering them insoluble. 

Since the value computed for m, is in units of the fourth power, the unitless statistic g,, 
which is an estimate of the population parameter y, , is commonly employed to express kurtosis. 
The value of g, is computed with Equation 1.23. 


m, i 
& = 2) (Equation I.23) 

When a distribution is mesokurtic the value of g, will equal 0. When the value of g, is 
significantly above 0, a distribution will be leptokurtic, and when it is significantly below 0, a 
distribution will be platykurtic. 

Zar (1999) notes that a population parameter designated B, is employed by some sources 
(e.g., Anscombe and Glynn (1983), D' Agostino (1986), and D' Agostino et al. (1990)) to repre- 
sent kurtosis. Equation L.24 is used to compute b,, which is the sample statistic employed to 
estimate the value of B,. 
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(Equation I.24) 


When a distribution is mesokurtic, the value of b, will equal [3n — 1)]/(n + 1). Inspection 
of the latter equation reveals that as the value of the sample size increases, the value of b, 
approaches 3. When the value computed for b, is significantly below [3(n — 1)]/(n + 1), a distri- 
bution will be platykurtic. When the value the computed for b, is significantly greater than 
[3(n — 1)]/(1 + 1), a distribution will be leptokurtic. The method for determining whether a g, 
and/or b, value is statistically significant is described under the single-sample test for evalu- 
ating population kurtosis (the concept of statistical significance is discussed in the latter part 
of the Introduction). The results of the latter test, along with the results of the single-sample 
test for evaluating population skewness, are used in the D'Agostino-Pearson test of 
normality, which is employed to assess goodness-of-fit for normality. As noted earlier, a normal 
distribution will always be mesokurtic, with g,= 0 and b, = 3. 

At this point employing Equations I.21/1.22, I.23 and I.24, the values of m,, g,, and b, will 
be computed for two distributions to be designated E and F. The data for Distributions E and 
F are designed (within the framework of a small sample size with n = 20) to approximate a 
leptokutic distribution and platykurtic distribution respectively. Tables I.5 and I.6 sum- 
marize the computations, with the following resulting values: m i 307.170, m Ros 1181.963, 


&, = 3.596, g, = —939,and b, = 5472, b, = 1.994. 


The Normal Distribution 


When an inferential statistical test is employed with one or more samples to draw inferences 
about one or more populations, such a test may make certain assumptions about the shape of an 
underlying population distribution. The most commonly encountered assumption in this regard 
is that a distribution is normal. When viewed from a visual perspective, the normal 
distribution (which as noted earlier is often referred to as the bell-shaped curve) is a graph of 
a frequency distribution which can be described mathematically and observed empirically 
(insofar as many variables in the real world appear to be distributed normally). The shape of the 
normal distribution is such that the closer a score is to the mean, the more frequently it occurs. 
As scores deviate more and more from the mean (1.e., become higher or lower), the more extreme 
the score, the lower the frequency with which that score occurs. As noted earlier, a normal dis- 
tribution will always be symmetrical (with y, = g, = 0 and (Bi = "A = 0) and mesokurtic 
(with y, = g, = 0 and p, = b, = 3). 

Any normal distribution can be converted into what is referred to as the standard normal 
distribution, by assigning it a mean value of 0 (i.e., u = 0) and a standard deviation of 1 (i.e., 
o=1). Thestandard normal distribution, which is represented in Figure I.3, is employed more 
frequently in inferential statistics than any other theoretical probability distribution. The use of 
the term theoretical probability distribution in this context is based on the fact that it is known 
that in the standard normal distribution (or, for that matter, any normal distribution) a certain 
proportion of cases will always fall within specific areas of the curve. As a result of this, if one 
knows how far removed a score is from the mean of the distribution, one can specify the 
proportion of cases that obtain that score, as well as the likelihood of randomly selecting a 
subject or object with that score. 

The general equation for the normal distribution is Equation I.25. 


Y- s -X = yo? (Equation I.25) 


oy2n 
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Table I.5 Computation of Kurtosis for Distribution E 


X x? x? xt X X-X) X-Xy  ux-xy 
2 4 8 16 10 -8 64 4096 
7 49 343 2401 10 -3 9 81 
8 64 512 4096 10 m 4 16 
8 64 512 4096 10 cy 4 16 
8 64 512 4096 10 E, 4 16 
9 81 729 6561 10 -1 1 1 
9 81 729 6561 10 -1 1 1 
9 81 729 6561 10 -1 1 1 

10 100 1000 10000 10 0 0 0 

10 100 1000 10000 10 0 0 0 

10 100 1000 10000 10 0 0 0 

10 100 1000 10000 10 0 0 0 

11 121 1331 14641 10 1 1 1 

11 121 1331 14641 10 1 1 1 

11 121 1331 14641 10 1 1 1 

12 144 1728 20736 10 2 4 16 

12 144 1728 20736 10 2 4 16 

12 144 1728 20736 10 2 4 16 

13 169 2197 28561 10 3 9 81 

18 324 5832 104976 10 8 64 4096 


Sums: XX - 200, XX? - 2176, XX? - 25280, XX* - 314056 
EX - X) = 0, XX - Xy = 176, XX - XY = 8456 








2 2 
Xx? (2X) 2176 - 200" 
x - =X _ 200 _ 4 Quero ” |e % 34 
Bt 20 . n-1 20 - 1 


. (DX -X Én + Dan - 0] - 3DXx - Xyr 
E (n - 2)(n - 3) 





. [[(8456)20)20 + 1)/Q0 - D] - 3(176)° 
(20 - 2)(20 - 3) 


- 307.170 





— (n? + n2)XX* - 4(n? + nXX?Xx - 3(n? - nXX?y. + 12nXXXXxy. - e(Xxy 
Fa irs LEO ES EL MB CLR RH ILES SIS MES CPI 
E n(n - Dn - 2)(n - 3) 


= [[20?? + (20)7](314056) - 4[(20)? + 20](25280)(200) - 3[(20)? - 20](2176)? 


+ 12(20)(2176)(200) - 6(200)*] / [(20)(20 - 1)(20 - 2)(20 - 3)] = 307.170 


(Qn - 2n - 38, , 3 - 1)_ Q0 - 200 - 33.596, 3Q0 - 1) 


^r (n + Dn - 1) (n * 1) (20 + 1)(20 - 1) (20 + 1) 





= 5.472 
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Table I.6 Computation of Kurtosis for Distribution F 


x x? x? x* X (=k) Q- wv) 
0 0 0 10 -10 100 10000 
1 1 1 1 10 -9 81 6561 
3 9 27 81 10 -7 49 2401 
3 9 27 81 10 -7 49 2401 
5 25 125 625 10 -5 25 625 
5 25 125 625 10 -5 25 625 
8 64 512 4096 10 =) 4 16 
8 64 512 4096 10 2 4 16 

10 100 1000 10000 10 0 0 0 

10 100 1000 10000 10 0 0 0 

10 100 1000 10000 10 0 0 0 

10 100 1000 10000 10 0 0 0 

12 144 1728 20736 10 2 4 16 

12 144 1728 20736 10 2 4 16 

15 285 3375 50625 10 5 25 625 

15 225 3375 50625 10 5 25 625 

17 289 4913 83521 10 7 49 2401 

17 289 4913 83521 10 7 49 2401 

19 361 6859 130321 10 9 81 6561 

20 400 8000 160000 10 10 10 10000 


Sums: XX = 200, XX? = 2674, XX? = 40220, XX^ = 649690 
XXX - X) = 0, XX - XY = 674, XX - X} = 45290 





2 2 
xg. E 2n = A 
x. - XX _ 200 _ 9 z- |— NT VES DERE I 
F nm 20 A n-1 20 - 1 


. (DX -X fMn + I) Va - 0] - 3DXx - Xyr 
F (n - 2)(n - 3) 





. [[45290)20)Q0 + DyQ0 - D] - 3(674y - 
(20 - 2)(20 - 3) 


-1181.963 





A rs (n? + n2)EX* - 4(n? + nX?XX - 3(n? - n(XXx?y. + 12nXX Xxy. - e(Xxy 
^r n(n - Dn - 2)n - 3) 


= [[20? + (20)°](649690) - 4[(20)? + 20](40220)(200) - 3[(20)? - 2012674» 


+ 12(20)(2674)(200)? - 6(200)*] / [(20)(20 - 1)(20 - 2)(20 - 3)] = -1181.963 


_ M = 2n - 9&, 3(-1) 20 -220 - 3)(-.939) , 3Q0 - 1) 


A (n + D)(n - 1) (n * 1) (20 + 1)(20 - 1) (20 + 1) 





= 1.994 


© 2000 by Chapman & Hall/CRC 


-3 -2 -l 0 +] +2 +3 z 


Standard deviation scores (z scores) 


Figure I.3 The Standard Normal Distribution 


In Equation I.25 the symbols u and o represent the mean and standard deviation of a normal 
distribution. For any normal distribution where the values of u and o are known, a value of Y (which 
represents the height of the distribution at a given point on the abscissa) can be computed simply by 
substituting a specified value of X in Equation L25. Note that in the case of the standard normal 
distribution, where u = 0 and o = 1, Equation I.25 becomes Equation 1.26.5 


yell (Equation I.26) 


E 


The reader should take note of the fact that the normal distribution is a family of distri- 
butions which is comprised of all possible values of u and o that can be substituted in Equation 
L25. Although the values of u and o for a normal distribution may vary, as noted earlier, all 
normal distributions are mesokurtic. 

For any variable that is normally distributed, regardless of the values of the population 
mean and standard deviation, the distance of a score from the mean in standard deviation units 
can be computed with Equation L27. The z value computed with Equation I.27 is a measure in 
standard deviation units of how far a score is from the mean. 


X-u 
[9] 





ZS (Equation I.27) 


Where: Xis a specific score 
is the value of the population mean 
o is the value of the population standard deviation 


When Equation I.27 is employed, any score that is above the mean will yield a positive z 
score, and any score that is below the mean will yield a negative z score. Any score that is equal 
to the mean will yield a z score of zero. 

To illustrate this, assume we have an IQ test for which it is known that the population mean 
is u = 100 and the population standard deviation is o = 15. Assume three people take the test and 
obtain the following IQ scores: Person A: 135; Person B: 65; and Person C: 100. The z score 
(standard deviation score) for each person is computed below. The reader should take note of 
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the fact that a z score is always computed to at least two decimal places. 


Person A: z = 135 - 100 - 2.33 
15 

Person B: z = 65 — 100 = -2.33 
15 

Person C: z = DL =0 


Person A obtains an IQ score that is 2.33 standard deviation units above the mean, Person 
B obtains an IQ score that is 2.33 standard deviation units below the mean, and Person C obtains 
an IQ score at the mean. If we wanted to determine the likelihood (i.e., the probability) of select- 
ing a person (as well as the proportion of people) who obtains a specific score in a normal 
distribution, Table A1 (Table of the Normal Distribution) in the Appendix can provide this 
information. Although Table A1 is comprised of four columns, for the analysis to be discussed 
in this section we will only be interested in the first three columns. 

Column 1 in Table A1 lists z scores that range in value from 0 to an absolute value of 4. 
The use of the term absolute value of 4 is based on the fact that since the normal distribution is 
symmetrical, anything we say with respect to the probability or the proportion of cases associated 
with a positive z score will also apply to the corresponding negative z score. Note that positive 
z scores will always fall to the right of the mean (often referred to as the right tail of the distribu- 
tion), thus indicating that the score is above the mean. Negative z scores, on the other hand, will 
always fall to the left of the mean (often referred to as the left tail of the distribution), thus indi- 
cating that the score is below the mean. 

Column 2 in Table A1 lists the proportion of cases (which can also be interpreted as prob- 
ability values) that falls between the mean of the distribution and the z score that appears in a 
specific row. 

Column 3 in the table lists the proportion of cases that falls beyond the z score in that row. 
More specifically, the proportion listed in Column 3 is evaluated in relation to the tail of the 
distribution in which the score appears. Thus, if a z score is positive, the value in Column 3 
will represent the proportion of cases that falls above that z score, whereas if the z score is 
negative, the value in Column 3 will represent the proportion of cases that falls below that z 
score.’ 

Table A1 will now be employed in reference to the IQ scores of Person A and Person B. 
For both subjects the computed absolute value of z associated with their IQ score is z 2 2.33. For 
z = 233, the tabled values in Columns 2 and 3, are respectively, .4901 and .0099. The value in 
Column 2 indicates that the proportion of the population that obtains a z score between the mean 
and z = 2.33 is .4901 (which expressed as a percentage is 49.01%),* and the proportion of the 
population which obtains a z score between the mean and z = —2.33 is .4901. We can make com- 
parable statements with respect to the IQ values associated with these z scores. Thus, we can say 
that the proportion of the population that obtains an IQ score between 100 and 135 is .4901, and 
the proportion of the population which obtains an IQ score between 65 and 100 is .4901. Since 
the normal distribution is symmetrical, .5 (or 5096) represents the proportion of cases that 
falls both above and below the mean. Thus, we can determine that .5 + .4901 = .9901 (or 99.01%) 
is the proportion of people with an IQ of 135 or less, as well as the proportion of people who have 
an IQ of 65 or greater. We can state that a person who has an IQ of 135 has a score that falls at 
approximately the 99th percentile, since it is equal to or greater than the scores of 99% of the 
population. On the other hand, a person who has an IQ of 65 has a score that falls at the Ist 
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percentile, since it is equal to or greater than the scores of only approximately 1% of the popu- 
lation. 

The value in Column 3 indicates that the proportion of the population that obtains a score 
of z = 2.33 or greater (and thus, in reference to Person A, an IQ of 135 or greater) is .0099 (which 
is .99%). In the same respect, the proportion of the population that obtains a score of z = —2.33 
or less (and thus, in reference to Person B, an IQ of 65 or less) is .0099. 

If one interprets the values in Columns 2 and 3 as probability values instead of proportions, 
we can state that if one randomly selects a person from the population, the probability of 
selecting someone with an IQ of 135 or greater will be approximately 196. In the same respect, 
the probability of selecting someone with an IQ of 65 or less will also be approximately 1%. 

In the case of Person C, whose IQ score of 100 results in the standard deviation score z= 0, 
inspection of Table A1 reveals that the values in Columns 2 and 3 associated with z = 0 are, 
respectively, .0000 and .5000. This indicates that the proportion of the population that obtains 
an IQ of 100 or greater is .5 (which is equivalent to 5096), and that the proportion of the 
population which obtains an IQ of 100 or less is .5. Thus, if we randomly select a person from 
the population, the probability of selecting someone with an IQ equal to or greater than 100 will 
be .5, and the probability of selecting someone with an IQ equal to or less than 100 will be .5. 
We can also state that the score of a person who has an IQ of 100 falls at the 50th percentile, 
since it is equal to or greater than the scores of 5046 of the population. 

Note that to determine a percentile rank associated with a positive z value (or a score that 
results in a positive z value), 50% should be added to the percentage of cases that fall between 
the mean and that z value — in other words, the entry for the z value in Column 2 expressed as 
a percentage is added to 50%. The 50% we add to the value in Column 2 represents the 
percentage of the population that scores below the mean. The percentile rank for a negative z 
value (or a score that results in a negative z value) is the entry in Column 3 for that z value 
expressed as a percentage. 

Figure I.4 provides a graphic summary of the proportions discussed in the above analysis. 
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Figure I.4 Summary of Normal Curve Problem 
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Hypothesis Testing 


Inferential statistics primarily employ sample data in two ways to draw inferences about one or 
more populations. The two methodologies employed in inferential statistics are hypothesis test- 
ing and estimation of population parameters. This section will discuss hypothesis testing. 

Within the framework of inferential statistics, a hypothesis can be defined as a prediction 
about a single population or about the relationship between two or more populations. 
Hypothesis testing is a procedure in which sample data are employed to evaluate a hypothesis. 
In using the term hypothesis, some sources make a distinction between a research hypothesis 
and statistical hypotheses. 

A research hypothesis is a general statement of what a researcher predicts. Two examples 
of a research hypothesis are: a) The average IQ of all males is some value other than 100; and 
b) Clinically depressed patients who take an antidepressant for six months will be less depressed 
than clinically depressed patients who take a placebo for six months. 

In order to evaluate a research hypothesis, it is restated within the framework of two 
statistical hypotheses. Through use of a symbolic format, the statistical hypotheses summarize 
the research hypothesis with reference to the population parameter or parameters under study. 
The two statistical hypotheses are the null hypothesis, which is represented by the notation H, 
and, the alternative hypothesis, which is represented by the notation H,. 

The null hypothesis is a statement of no effect or no difference. Since the statement of 
the research hypothesis generally predicts the presence of an effect or a difference with respect 
to whatever it is that is being studied, the null hypothesis will generally be a hypothesis that the 
researcher expects to be rejected. The alternative hypothesis, on the other hand, represents a 
statistical statement indicating the presence of an effect or a difference. Since the research hy- 
pothesis typically predicts an effect or difference, the researcher generally expects the alternative 
hypothesis to be supported. 

The null and alternative hypotheses will now be discussed in reference to the two research 
hypotheses noted earlier. Within the framework of the first research hypothesis that was pre- 
sented, we will assume that a study is conducted in which an IQ score is obtained for each of n 
males who have been randomly selected from a population comprised of N males. The null and 
alternative hypotheses can be stated as follows: Hy: u = 100 and H,: u # 100. The null hy- 
pothesis states that the mean (IQ score) of the population the sample represents equals 100. The 
alternative hypothesis states that the mean of the population the sample represents does not equal 
100. The absence of an effect will be indicated by the fact that the sample mean is equal to or 
reasonably close to 100. If such an outcome is obtained, a researcher can be reasonably 
confident that the sample has come from a population with a mean value of 100. The presence 
of an effect, on the other hand, will be indicated by the fact that the sample mean is significantly 
above or below the value 100. Thus, if the sample mean is substantially larger or smaller than 
100, the researcher can conclude there is a high likelihood that the population mean is some value 
other than 100, and thus reject the null hypothesis. 

As stated above, the alternative hypothesis is nondirectional. A nondirectional (also 
referred to as a two-tailed) alternative hypothesis does not make a prediction in a specific 
direction. The alternative hypothesis H,: u * 100 just states that the population mean will not 
equal 100, but it does not predict whether it will be less than or greater than 100. If, however, 
a researcher wants to make a prediction with respect to direction, the alternative hypothesis can 
also be stated directionally. Thus, with respect to the above example, either of the following two 
directional (also referred to as one-tailed) alternative hypotheses can be employed: 
H: u > 100 or H,: u < 100. 

The alternative hypothesis H,: u > 100 states that the mean of the population the sample 
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represents is some value greater than 100. If the directional alternative hypothesis H,: u > 100 
is employed, the null hypothesis can only be rejected if the data indicate that the population mean 
is some value above 100. The null hypothesis cannot, however, be rejected if the data indicate 
that the population mean is some value below 100. 

The alternative hypothesis H,: u < 100 states that the mean of the population the sample 
represents is some value less than 100. If the directional alternative hypothesis H,: u < 100 is 
employed, the null hypothesis can only be rejected if the data indicate that the population mean 
is some value below 100. The null hypothesis cannot, however, be rejected if the data indicate 
that the population mean is some value above 100. The reader should take note of the fact that 
although there are three possible alternative hypotheses that one can employ (one that is non- 
directional and two that are directional), the researcher will select only one of the alternative 
hypotheses. 

Researchers are not in agreement with respect to the conditions under which one should 
employ a nondirectional or a directional alternative hypothesis. Some researchers take the posi- 
tion that a nondirectional alternative hypothesis should always be employed, regardless of one's 
prior expectations about the outcome of an experiment. Other researchers believe that a non- 
directional alternative hypothesis should only be employed when one has no prior expectations 
about the outcome of an experiment (i.e., no expectation with respect to the direction of an effect 
or difference). These same researchers believe that if one does have a definite expectation about 
the direction of an effect or difference, a directional alternative hypothesis should be employed. 
One advantage of employing a directional alternative hypothesis is that in order to reject the null 
hypothesis, a directional alternative hypothesis does not require that there be as large an effect 
or difference in the sample data as will be the case if a nondirectional alternative hypothesis is 
employed. 

The second of the research hypotheses discussed earlier in this section predicted that an 
antidepressant will be more effective that a placebo in treating depression. Let us assume that 
in order to evaluate this research hypothesis, a study is conducted which involves two groups of 
clinically depressed patients. One group, which will represent Sample 1, is comprised of n, 
patients, and the other group, which will represent Sample 2, is comprised of n, patients. The 
subjects in Sample 1 take an antidepressant for six months, and the subjects in Sample 2 take a 
placebo during the same period of time. After six months have elapsed, each subject is assigned 
a score with respect to his or her level of depression. 

The null and alternative hypotheses can be stated as follows: H: p, = p, and 
H,: u, * gu. The null hypothesis states that the mean (depression score) of the population 
Sample 1 represents equals the mean of the population Sample 2 represents. The alternative 
hypothesis (which is stated nondirectionally) states that the mean of the population Sample 1 
represents does not equal the mean of the population Sample 2 represents. In this instance the 
two populations are a population comprised of N, clinically depressed people who take an anti- 
depressant for six months versus a population comprised of N, clinically depressed people who 
take a placebo for six months. The absence of an effect or difference will be indicated by the fact 
that the two sample means are exactly the same value or close to being equal. If such an outcome 
is obtained, a researcher can be reasonably confident that the samples do not represent two 
different populations.’ The presence of an effect, on the other hand, will be indicated if a sig- 
nificant difference is observed between the two sample means. Thus, we can reject the null 
hypothesis if the mean of Sample 1 is significantly larger than the mean of Sample 2, or the mean 
of Sample 1 is significantly smaller than the mean of Sample 2. 

Asis the case with the first research hypothesis discussed earlier, the alternative hypothesis 
can also be stated directionally. Thus, either of the following two directional alternative hypoth- 
eses can be employed: H,: u; > p, or Hy: p, < ps. 
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The alternative hypothesis H,: u; > p, states that the mean of the population Sample 1 
represents is greater than the mean of the population Sample 2 represents. If the directional 
alternative hypothesis H,: u, > u, is employed, the null hypothesis can only be rejected if the 
data indicate that the mean of Sample 1 is significantly greater than the mean of Sample 2. The 
null hypothesis cannot, however, be rejected if the mean of Sample 1 is significantly less than 
the mean of Sample 2. 

The alternative hypothesis H,: y, < u, states that the mean of the population Sample 1 
represents is less than the mean of the population Sample 2 represents. If the directional 
alternative hypothesis H,: u, < u, is employed, the null hypothesis can only be rejected if the 
data indicate that the mean of Sample 1 is significantly less than the mean of Sample 2. The null 
hypothesis cannot, however, be rejected if the mean of Sample 1 is significantly greater than the 
mean of Sample 2. 

Upon collecting the data for a study, the next step in the hypothesis testing procedure is to 
evaluate the data through use of the appropriate inferential statistical test. An inferential 
statistical test yields a test statistic. The latter value is interpreted by employing special tables 
that contain information with regard to the expected distribution of the test statistic. More 
specifically, such tables contain extreme values of the test statistic (referred to as critical values) 
that are highly unlikely to occur if the null hypothesis is true. Such tables allow a researcher to 
determine whether or not the result of a study is statistically significant. 

The term statistical significance implies that one is determining whether an obtained differ- 
ence in an experiment is due to chance or is the result of a genuine experimental effect. To 
clarify this, think of a roulette wheel on which there are 38 possible numbers that may occur on 
any roll of the wheel. Suppose we spin a wheel 38,000 times. On the basis of chance each 
number should occur 1/38" of the time, and thus each value should occur 1000 times (i.e., 38000 
+38). Suppose the number 32 occurs 998 times in 38,000 spins of the wheel. Since this value 
is close to the expected value of 1000, it is highly unlikely that the wheel is biased against the 
number 32, and is thus not a fair wheel (at least in reference to the number 32). This is because 
998 is extremely close to 1000, and a difference of 2 outcomes isn't unlikely on the basis of the 
random occurrence of events (i.e., chance). On the other hand, if the number 32 only occurs 380 
times in 38,000 trials (i.e., 1/100" of the time), since 380 is well below the expected value of 
1000, this strongly suggests that the wheel is biased against the number 32 (and is thus probably 
biased in favor of one or more of the other numbers). On the basis of this, one would probably 
conclude that the wheel is defective and should be replaced. 

When evaluating the results of an experiment, one employs a logical process similar to that 
involved in the above situation with the roulette wheel. The decision on whether to retain or 
reject the null hypothesis is based on contrasting the observed outcome of an experiment with 
the outcome one can expect if, in fact, the null hypothesis is true. This decision is made by using 
the appropriate inferential statistical test. An inferential statistical test is essentially an equation 
which describes a set of mathematical operations that are to be performed on the data obtained 
inastudy. The end result of conducting such a test is a final value which is designated as the test 
statistic. A test statistic is evaluated in reference to a sampling distribution, which is a theoret- 
ical probability distribution of all the possible values the test statistic can assume if one were to 
conduct an infinite number of studies employing a sample size equal to that used in the study 
being evaluated. The probabilities in a sampling distribution are based on the assumption that 
each of the samples is randomly drawn from the population it represents. 

When evaluating the study involving the use of a drug versus a placebo in treating depres- 
sion, one is asking if the difference between the scores of the two groups is due to chance, or if 
instead, it is due to some nonchance factor (which in a well controlled study will be the differ- 
ential treatment to which the groups are exposed). The larger the difference between the average 
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scores of the two groups (just like the larger the difference between the observed and expected 
occurrence of a number on a roulette wheel), the less likely the difference is due to chance 
factors, and the more likely it is due to the experimental treatments. Thus, by declaring a 
difference statistically significant, the researcher is saying that based on an analysis of the 
sampling distribution of the test statistic, it is highly unlikely that a difference equal to or greater 
than that which was observed in the study could have occurred as result of chance. In view of 
this, the most logical decision is to conclude that the difference is due to the experimental 
treatments, and thus reject the null hypothesis. 

Scientific convention has established that in order to declare a difference statistically sig- 
nificant, there can be no more than a 5% likelihood that the difference is due to chance. If a 
researcher believes that 596 is too high a value, one may elect to employ a 196, or an even lower 
minimum likelihood, before one will be willing to conclude that a difference is significant. The 
notation p > .05 is employed to indicate that the result of an experiment is not significant. This 
notation indicates that there is a greater than 596 likelihood that an observed difference or effect 
could be due to chance. On the other hand, the notation p « .05 indicates that the outcome of a 
study is significant at the .05 level. This indicates that there is less than a 5% likelihood that 
an obtained difference or effect can be due to chance. The notation p « .01 indicates a significant 
result at the .01 level (i.e., there is less than a 1% likelihood that the difference is due to chance). 

When the normal distribution is employed for inferential statistical analysis, four tabled 
critical values are commonly employed. These values are summarized in Table I.7. 


Table I.7 Tabled Critical Two-Tailed and One-Tailed .05 and .01 z Values 


205 £9 
Two-tailed values 1.96 2.58 
One-tailed values 1.65 2.33 


The value z = 1.96 is referred to as the tabled critical two-tailed .05 z value. This value is 
employed since the total proportion of cases in the normal distribution that falls above z = +1.96 
or below z = —1.96 is .05. This can be confirmed by examining Column 3 of Table A1 for the 
value z = 1.96. The value of .025 in Column 3 indicates that the proportion of cases in the right 
tail of the curve that falls above z = +1.96 is .025, and the proportion of cases in the left tail of 
the curve that falls below z 2 —1.96 is .025. If the two .025 values are added, the resulting 
proportion is .05. Note that this is a two-tailed critical value, since the proportion .05 is based 
on adding the extreme 2.5% of the cases from the two tails of the distribution. 

The value z = 2.58 is referred to as the tabled critical two-tailed .01 z value. This value is 
employed since the total proportion of cases in the normal distribution that falls above z = +2.58 
or below z = —2.58 is .01. This can be confirmed by examining Column 3 of Table A1 for the 
value z 2 2.58. The value of .0049 (which rounded off equals .005) in Column 3 indicates that 
the proportion of cases in the right tail of the curve that falls above z = 42.58 is .0049, and the 
proportion of cases in the left tail of the curve that falls below z = —2.58 is .0049. If the two 
.0049 values are added, the resulting proportion is .0098, which rounded off equals .01. Note 
that this is a two-tailed critical value, since the proportion .01 is based on adding the extreme .5% 
of the cases from the two tails of the distribution. 

The value z = 1.65 is referred to as the tabled critical one-tailed .05 z value. This value is 
employed since the proportion of cases in the normal distribution that falls above z = +1.65 or 
below z = —1.65 in each tail of the distribution is .05. This can be confirmed by examining 
Column 3 of Table A1 for the value z 2 1.65. The value of .0495 (which rounded off equals .05) 
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in Column 3 indicates that the proportion of cases in the right tail of the curve that falls above z 
= +1.65 is .0495, and the proportion of cases in the left tail of the curve that falls below z = —1.65 
is .0495. Note that this is a one-tailed critical value, since the proportion .05 is based on the 
extreme 5% of the cases in one tail of the distribution." 

The value z = 2.33 is referred to as the tabled critical one-tailed .01 z value. This value is 
employed since the proportion of cases in the normal distribution that falls above z = +2.33 or 
below z = —2.33 in each tail of the distribution is .01. This can be confirmed by examining 
Column 3 of Table A1 for the value z = 2.33. The value of .0099 (which rounded off equals .01) 
in Column 3 indicates that the proportion of cases in the right tail of the curve that falls above 
z = 42.33 is .0099, and the proportion of cases in the left tail of the curve that falls below z = 
—2.33 is .0099. Note that this is a one-tailed critical value, since the proportion .01 is based on 
the extreme 1% of the cases in one tail of the distribution. 

Although in practice it is not scrupulously adhered to, the conventional hypothesis testing 
model employed in inferential statistics assumes that prior to conducting a study a researcher 
stipulates whether a directional or nondirectional alternative hypothesis will be employed, as well 
as at what level of significance the null hypothesis will be evaluated. The probability value 
which identifies the level of significance is represented by the notation a, which is the lower case 
Greek letter alpha. Throughout the book the latter value will be referred to as the prespecified 
alpha value (or prespecified level of significance), since it will be assumed that the value was 
specified prior to the data collection phase of a study. 

When one employs the term significance in the context of scientific research, it is 
instructive to make a distinction between statistical significance and practical significance. 
Statistical significance only implies that the outcome of a study is highly unlikely to have 
occurred as a result of chance. It does not necessarily suggest that any difference or effect 
detected in a set of data is of any practical value. As an example, assume that the Scholastic 
Aptitude Test (SAT) scores of two school districts that employ different teaching methodologies 
are contrasted. Assume that the teaching methodology of each school district is based on 
specially designed classrooms. The results of the study indicate that the SAT average in School 
District A is one point higher than the SAT average in School District B, and this difference is 
statistically significant at the .01 level. Common sense suggests that it would be illogical for 
School District B to invest the requisite time and money in order to redesign its physical 
environment for the purpose of increasing the average SAT score in the district by one point. 
Thus, in this example, even though the obtained difference is statistically significant, in the final 
analysis it is of little or no practical significance. The general issue of statistical versus practical 
significance is discussed in more detail in Section VI of the f test for two independent samples, 
and under the discussion of meta-analysis and related topics in Section IX (the Addendum) 
of the Pearson product-moment correlation coefficient (Test 28). 


Type I and Type II errors in hypothesis testing Within the framework of hypothesis 
testing, it is possible for a researcher to commit two types of errors. These errors are referred to 
as a Type I error and a Type II error. 

A Type I error is when a true null hypothesis is rejected (i.e., one concludes that a false 
alternative hypothesis is true). The likelihood of committing a Type I error is specified by the 
alpha level a researcher employs in evaluating an experiment. The more concerned a researcher 
is with committing a Type I error, the lower the value of alpha the researcher should employ. 
Thus, the likelihood of committing a Type I error if a = .01, is 1%, as compared with a 5% 
likelihood if a = .05. 

A Type II error is when a false null hypothesis is retained (i.e., one concludes that a true 
alternative hypothesis is false). The likelihood of committing a Type II error is represented by 
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P. which (as noted earlier) is the lower case Greek letter beta. The likelihood of rejecting a false 
null hypothesis represents what is known as the power of a statistical test. The power of a test 
is determined by subtracting the value of beta from 1 (i.e., Power = 1 — f). The likelihood of 
committing a Type II error is inversely related to the likelihood of committing a Type I error. 
In other words, as the likelihood of committing one type of error decreases, the likelihood of 
committing the other type of error increases. Thus, with respect to the alternative hypothesis one 
employs, there is a higher likelihood of committing a Type II error when alpha is set equal to .01 
than when it is set equal to .05. The likelihood of committing a Type II error is also inversely 
related to the power of a statistical test. In other words, as the likelihood of committing a Type 
II error decreases, the power of the test increases. Consequently, the higher the alpha value (1.e., 
the higher the likelihood of committing a Type I error), the more powerful the test. 

Although the hypothesis testing model as described here is based on conducting a single 
study in order to evaluate a research hypothesis, throughout the book the author emphasizes the 
importance of replication in research. This recommendation is based on the fact that inferential 
statistical tests make certain assumptions, many of which a researcher can never be sure have 
been met. Since the accuracy of the probability values in tables of critical values for test statistics 
are contingent upon the validity of the assumptions underlying the test, if any of the assumptions 
have been violated, the accuracy of the tables can be compromised. In view of this, the most 
effective way of determining the truth with regard to a particular question, especially if practical 
decisions are to be made on the basis of the results of research, is to conduct multiple studies 
which evaluate the same hypothesis. When multiple studies yield consistent results, one is less 
likely to be challenged that the correct decision has been made with respect to the hypothesis 
under study. A general discussion of statistical methods that can be employed to aid in the 
interpretation of the results of multiple studies that evaluate same general hypothesis can be 
found under the discussion of meta-analysis and related topics in Section IX (the Addendum) 
of the Pearson product-moment correlation coefficient. 


Estimation in Inferential Statistics 


In addition to hypothesis testing, inferential statistics can also be employed for estimating the 
value of one or more population parameters. Within this framework there are two types of esti- 
mation. Point estimation (which is the less commonly employed of the two methods) involves 
estimating the value of a parameter from the computed value of a statistic. The more commonly 
employed method of estimation is interval estimation, which involves computing a range of 
values within which a researcher can state with a high degree of confidence the true value of the 
parameter falls. Such a range of values is referred to as a confidence interval. As an example, 
a 95% confidence interval for a population mean stipulates the range of values within which a 
researcher can be 95% confident that the true value of the population mean falls. Stated in prob- 
abilistic terms, there is a probability/likelihood of .95 that the true value of the population mean 
falls within the range of values that define the 95% confidence interval. 

Another measure that is often estimated within the framework of research is the magnitude 
of treatment effect (also referred to as effect size) present in a study. Effect size is a value that 
indicates the proportion or percentage of variability on a dependent variable that can be attrib- 
uted to variation on the independent variable (the terms dependent variable and independent 
variable are defined later in this section). Throughout this book the concept of effect size is 
discussed, and numerous measures of effect size are presented. At the present time there is con- 
siderable debate among researchers with regard to the role that measures of effect size should be 
accorded in summarizing the results of research. In point of fact, during the past 20 years an 
increasing number of individuals have become highly critical of the traditional hypothesis testing 
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model and the associated concept of statistical significance. These individuals have argued that 
a measure of the effect size computed for a study is more meaningful than whether or not an 
inferential statistical test has yielded a statistically significant result. The controversy 
surrounding effect size versus statistical significance is discussed in detail under the discussion 
of meta-analysis and related topics in Section IX (the Addendum) of the Pearson product- 
moment correlation coefficient. In addition, it is also addressed within the framework of the 
discussion of measures of effect size throughout the book (e.g., in Section VI of both the f test 
for two independent samples and the single-factor between-subjects analysis of variance 
(Test 21)). 


Basic Concepts and Terminology Employed in Experimental 
Design 


Inferential statistical tests can be employed to evaluate data that are generated from a broad range 
of experimental designs. This section will review some basic terminology in the area of experi- 
mental design that will be employed throughout this book. 

Typically, experiments involve two or more experimental conditions. These conditions 
are often referred to as treatments and, in experiments where different subjects serve in each of 
the conditions, the term groups is commonly employed to differentiate the conditions from one 
another. At this point it is instructive to review the distinction between a between-subjects de- 
sign and a within-subjects design. A between-subjects design (also known as an independent- 
groups design) is one in which different subjects serve in each of the experimental conditions. 
In a within-subjects design (also referred to as a repeated-measures design, dependent 
samples design, and correlated samples design), each subject serves in all of the experimental 
conditions. A design involving matched subjects is also treated as a within-subjects design. 
In a matched-subjects design (which is discussed in detail under the ¢ test for two dependent 
samples (Test 17)), each subject is paired with one or more other subjects who are similar with 
respect to one or more characteristics that are highly correlated with the dependent variable 
(which will be discussed shortly). A matched-subjects design and a within-subjects design 
are sometimes categorized as a randomized-blocks design. The latter term refers to a design 
that employs homogeneous blocks of subjects (which matched subjects represent). When a 
within-subjects design is conceptualized as a randomized-blocks design, it is because within 
each block the same subject is matched with himself by virtue of serving under all of the 
experimental conditions. 

A basic distinction in experimental design is that made between an independent and a 
dependent variable. In any experiment involving two or more experimental conditions, the 
independent variable is the experimental manipulation or preexisting subject characteristic that 
distinguishes the different experimental conditions from one another. Thus, in the antidepressant 
drug study discussed earlier, the independent variable is whether or not a subject receives a drug 
or a placebo. Since the number of levels of an independent variable corresponds to the number 
of experimental conditions or treatments, the independent variable in the drug study is comprised 
of two levels. 

A dependent variable is the specific measure which is hypothesized to be influenced by 
or associated with the independent variable. Thus in the drug study, the depression scores of sub- 
jects represent the dependent variable, since it is hypothesized that the depression score of a 
subject will be a function of which treatment the subject receives. When a null hypothesis is 
rejected in an experiment involving two or more treatments, the researcher is concluding that the 
subjects’ scores on the dependent variable are dependent upon which level of the independent 
variable they were assigned. 

It is possible to have more than one independent variable in an experiment. Experimental 
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designs that involve more than one independent variable are referred to as factorial designs. In 
such experiments, the number of independent variables will correspond to the number of factors 
in the experiment, and each independent variable/factor will be comprised of two or more levels. 
It is also possible to have more than two dependent variables in an experiment. Typically, 
experiments involving two or more dependent variables are evaluated with multivariate 
statistical procedures, a topic that will not be covered in detail in this book. 

In describing a between-subjects design, a distinction is commonly made between a true 
experiment as opposed to a natural experiment (which is also referred to as an ex post facto 
study). This distinction is predicated on the fact that in a true experiment the following applies: 
a) Subjects are randomly assigned to a group; and b) The independent variable is manipulated 
by the experimenter. The antidepressant drug study illustrates an example of an experiment in 
which the independent variable is manipulated, since in that study the experimenter determined 
who received the drug and who received the placebo. In a natural experiment random assign- 
ment of subjects to groups is impossible, since the independent variable is not manipulated by 
the experimenter, but instead is some preexisting subject characteristic (such as gender, race, 
etc.). Thus, if we compare the overall health of smokers and nonsmokers, the independent 
variable in such a study is whether or not a person smokes, which is something that is determined 
by "nature" prior to the experiment. 

The advantage of a true experiment over a natural experiment is that the true 
experiment allows a researcher to exercise much greater control over the experimental situation. 
Since the experimenter randomly assigns subjects to groups in the true experiment, it is 
assumed that the groups formed are equivalent to one another, and as a result of this any 
differences between the groups with respect to the dependent variable can be directly attributed 
to the manipulated independent variable. The end result of all this is that the true experiment 
allows a researcher to draw conclusions with regard to cause and effect. 

The natural experiment, on the other hand, does not allow one to draw conclusions with 
regard to cause and effect. Essentially the type of information that results from a natural experi- 
ment is correlational in nature. Such experiments can only tell a researcher that a statistical 
association exists between the independent and dependent variables. The reason why natural 
experiments do not allow a researcher to draw conclusions with regard to cause and effect is that 
such experiments do not control the potential effects of confounding variables (also known as 
extraneous variables). A confounding variable is any variable that systematically varies with 
the different levels of the independent variable. To illustrate, assume that in a study comparing 
the overall health of smokers and nonsmokers, unbeknownst to the researcher all of the smokers 
in the study are people who have high stress jobs and all the nonsmokers are people with low 
stress jobs. If the outcome of such a study indicates that smokers are in poorer health than non- 
smokers, the researcher will have no way of knowing whether the inferior health of the smokers 
is due to smoking and/or job stress, or even to some other confounding variable of which he is 
unaware. 


Correlational Research 


As noted above, natural experiments only provide correlational information. In point of fact, 
there are a large number of correlational measures that have been developed which are appro- 
priate for use with different kinds of data. Measures of correlation are not inferential statistical 
tests, but are instead, descriptive measures which indicate the degree to which two or more 
variables are related to one another. Upon computing a measure of correlation, it is common 
practice to employ one or more inferential statistical tests to evaluate one or more hypotheses 
concerning the degree to which the variables are related to one another. Although correlations 
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can be computed for more than two variables, the primary focus in this book will be on bivariate 
correlational procedures, which are procedures that measure the degree of association between 
two variables. 

In the typical correlational study, scores on two measures/variables are available for n 
subjects. A major goal of correlational research is to determine the degree to which a subject's 
score on one variable can be predicted, if one knows the score of the subject on the second vari- 
able. As a general rule (although there are exceptions), the value computed for a measure of 
correlation/association (often referred to as a correlation coefficient) will usually fall within a 
range of values between 0 and an absolute value of 1. Whereas a value of 0 indicates that no 
statistical relationship exists between the variables, an absolute value of 1 indicates the presence 
of a maximal relationship. Consequently, the closer the absolute value of a correlation is to 1, 
the stronger the relationship between the two variables. To state it another way, the closer the 
absolute value of a correlation is to 1, the more accurately a subject's score on one variable can 
be predicted from the subject's score on the second variable. 

Many measures of correlation can assume both positive and negative values, and, typically, 
in such cases, the range of values the coefficient of correlation can assume is between —1 and +1. 
Whereas the absolute value of a correlation coefficient indicates the strength of the relationship 
between the two variables, the sign of the correlation coefficient indicates the nature of the 
relationship. A positive correlation indicates the presence of what is referred to as a direct 
relationship. In a direct relationship a change in one variable is associated with a change in 
the other variable in the same direction. On the other hand, a negative correlation indicates the 
presence of an indirect or inverse relationship between the variables. In an indirect relation- 
ship a change in one variable is associated with a change in the other variable in the opposite 
direction. A more comprehensive discussion of the subject of correlation can be found in 
Section I of the Pearson product-moment correlation coefficient. 


Parametric versus Nonparametric Inferential Statistical Tests 


The inferential statistical procedures discussed in this book have been categorized as being 
parametric versus nonparametric tests. Some sources distinguish between parametric and 
nonparametric tests on the basis that parametric tests make specific assumptions with regard 
to one or more of the population parameters that characterize the underlying distribution(s) for 
which the test is employed. These same sources describe nonparametric tests as making no 
such assumptions about population parameters. In truth, nonparametric tests are really not 
assumption free, and in view of this some sources (e.g., Marascuilo and McSweeney (1977)) 
suggest that it might be more appropriate to employ the term "assumption freer" rather than 
nonparametric in relation to such tests. 

The distinction employed in this book for categorizing a procedure as a parametric versus 
a nonparametric test is primarily based on the level of measurement represented by the data that 
are being analyzed. As a general rule, inferential statistical tests that evaluate categorical/ 
nominal data and ordinal/rank-order data are categorized as nonparametric tests, while those 
tests that evaluate interval data or ratio data are categorized as parametric tests. Although the 
appropriateness of employing level of measurement as a criterion in this context has been debated, 
its usage provides a reasonably simple and straightforward schema for categorization that 
facilitates the decision-making process for selecting an appropriate statistical test. 

There is general agreement among most researchers that as long as there is no reason to 
believe that one or more of the assumptions of a parametric test have been violated, when the 
level of measurement for a set of data is interval or ratio, the data should be evaluated with the 
appropriate parametric test. However, if one or more of the assumptions of a parametric test are 
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violated, some (but not all) sources believe it is prudent to transform the data into a format that 
makes it compatible for analysis with the appropriate nonparametric test. Related to this is that 
even though parametric tests generally provide a more powerful test of an alternative hypothesis 
than their nonparametric analogs, the power advantage of a parametric test may be negated if one 
or more of its assumptions are violated. 

The reluctance among some sources to transform interval/ratio data" into an ordinal/rank- 
order or categorical/nominal format for the purpose of analyzing it with a nonparametric test, is 
based on the fact that interval/ratio data contain more information than either of the latter two 
forms of data. Because of their reluctance to sacrifice information, these sources take the 
position that even when there is reason to believe that one or more of the assumptions of a 
parametric test has been violated, it is still more prudent to employ the appropriate parametric 
test. Generally, when a parametric test is employed under such conditions, certain adjustments 
are made in evaluating the test statistic in order to improve its reliability. 

In the final analysis, the debate concerning whether one should employ a parametric or 
nonparametric test for a specific experimental design turns out to be of little consequence in most 
instances. The reason for this is that most of the time a parametric test and its nonparametric 
analog are employed to evaluate the same set of data, they lead to identical or similar 
conclusions. This latter observation is demonstrated throughout this book with numerous 
examples. In those instances where the two types of test yield conflicting results, the truth can 
best be determined by conducting multiple experiments which evaluate the hypothesis under 
study. A detailed discussion of statistical methods that can be employed for pooling the results 
of multiple studies that evaluate the same general hypothesis can be found under the discussion 
of meta-analysis and related topics in Section IX (the Addendum) of the Pearson product- 
moment correlation coefficient. 


Selection of the Appropriate Statistical Procedure 


The Handbook of Parametric and Nonparametric Statistical Procedures is intended to be 
a comprehensive resource on inferential statistical tests and measures of correlation/association. 
The section to follow presents an outline of the statistical procedures covered in the book. 
Following the outline the reader is provided with guidelines and accompanying decision tables 
to facilitate the selection of the appropriate statistical procedure for a specific experimental 
design. 
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Endnotes 


1. Strictly speaking § is not an unbiased estimate of o, although it is usually employed as 
such. In point of fact, § slightly underestimates o, especially when the value of n is small. 
Zar (1999) notes that although corrections for bias in estimating o have been developed by 
Gurland and Tripathi (1971) and Tolman (1971), they are rarely employed, since they gen- 
erally have no practical impact on the outcome of an analysis. 


2.  Theinequality sign » means greater than. Some other inequality signs used throughout 
the book are «, which means less than; », which means greater than or equal to; and «, 
which means less than or equal to. 


3. The absolute value is the magnitude of a number irrespective of the sign. 


4. McElroy (1979) describes the use of the equation skewness = (X - M)/$ as an alternative 
approximate measure of skewness. McElroy (1979) and Zar (1999) describe the following 
measure of skewness, referred to as the Bowley coefficient of skewness (Bowley (1920)), 
which employs quartiles of the distribution (where Q, represents the i” quartile): 
Skewness = (Q, + Q, - 2Q,)/(Q, - Q,). The latter index yields values in the range - 1 
for a maximally negatively skewed distribution to +1 for a maximally positively skewed 
distribution. 


5. The symbol m in Equations I.25 and I.26 represents the mathematical constant pi (which 
equals 3.14159...). The numerical value of m represents the ratio of the circumference of 
a circle to its diameter. The value e in Equations I.25 and 1.26 equals 2.71828... . Like a, 
e is a fundamental mathematical constant. Specifically, e is the base of the natural system 
of logarithms, which will be clarified shortly. Both zx and e represent what are referred to 
as irrational numbers. An irrational number has a decimal notation that goes on forever 
without a repeating pattern of digits. In contrast, a rational number (derived from the 
word ratio) is either an integer or a fraction (which is the ratio between whole/integer 
numbers), which when expressed as a decimal always terminates at some point or assumes 
a repetitive pattern. Examples of rational numbers are 1/4 = .25 which has a terminating 
decimal, or 1/3 = .33333... , which is characterized by an endless repeating pattern of digits 
(Hoffman, 1998). 

A logarithm is the value of an exponent which indicates the power that a number, 
which is referred to as a base value, must be raised in order to yield a specific number 
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value. Typically, e and the number 10 are employed as base values for logarithms. Log- 
arithms that employ e as the base value are referred to as natural (or Naperian) logarithms, 
while logarithms that employ 10 as the base value are referred to as common (or 
Briggsian)logarithms. The notation log, 20 = 2.9957 or In 20 = 2.9957 indicates that 
the value of e (which is the base value of the logarithm) must be raised to the 2.9957th 
power in order to result in the value 20. Note that if the notation log is employed when e 
is the base value, the subscript e must be indicated, whereas if the notation In is employed, 
it is assumed that e is the base value of the logarithm. If the notation log 20 is employed 
without the e subscript, it is assumed that the base of the logarithm is 10. Thus, log 20 = 
1.3010, indicates that 10 must be raised to the 1.3010th power to equal 20. Logarithms are 
employed later in the book within the framework of the operations involved in a number 
of statistical procedures. 


6. Previously, the term tail was defined as the lower or upper extremes of a distribu- 
tion. Although the latter definition is correct, I am taking some liberty here by employing 
the term tail in this context to refer more generally to the left or right half of the dis- 
tribution. 


7. Although the values in Column 4 of Table A1 will not be employed in our example, a brief 
explanation of what they represent follows. In the case of the standard normal distribution, 
when a value of X is substituted in Equations I.25 or I.26, the value of X will correspond 
to a z score. When az value is employed to represent X, the value of Y computed with 
Equation I.25/1.26 will correspond to the value recorded for the ordinate in Column 4 of 
Table A1. The value of the ordinate represents the height of the normal curve for that z 
value. To illustrate, if the value z = 0 is employed to represent X, Equation I.25/1.26 re- 
duces to Y - V//2n, which equals Y = 1//(2)(3.1416) = .3989. The resulting value 
.3989 is the value recorded in Column 4 of Table A1 for the ordinate that corresponds to 
the z score z = 0. 


8. A proportion is converted into a percentage by moving the decimal point two places to the 
right. 


9. In actuality, the values of the sample means do not have to be identical to support the null 
hypothesis. Due to sampling error, which is a discrepancy between the value of a statistic 
and the parameter it estimates, even when two samples come from the same population, the 
value of the two sample means will usually not be identical. The larger the sample size em- 
ployed in a study, the less the influence of sampling error and, consequently, the closer one 
can expect two sample means to be to one another if, in fact, they do represent the same 
population. With small sample sizes, however, a large difference between sample means 
is not unusual even when the samples come from the same population and, because of this, 
a large difference may not be grounds for rejecting the null hypothesis. 


10. Some sources employ the notation p < .05, indicating a probability of equal to or less than 
.05. The latter notation will not be used unless the computed value of a test statistic is the 
exact value of the tabled critical value. 


11. Inspection of Column 3 in Table A1 reveals that the proportion for z = 1.64 is .0505. This 


latter value is the same distance from the proportion .05 as the value .0495 derived for z = 
1.65. If Table A1 documented proportions to five decimal places, it would turn out that z 
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= 1.65 yields a value that is slightly closer to .05 than does z = 1.64. Some books, however, 
do employ z = 1.64 as the tabled critical one-tailed .05 z value. 


12. Since interval and ratio data are viewed the same within the decision making process with 


respectto test selection, the expression interval/ratio will be used throughout to indicate that 
either type of data is appropriate for use with a specific test. 
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Outline of Inferential Statistical Tests and 
Measures of Correlation/Association 


I. Inferential statistical tests employed with a single sample 
A. Inferential statistical tests employed with interval/ratio data 
1. Inferential statistical tests employed with interval/ratio data for evaluating a 
hypothesis about the mean of a single population 
Test 1: The Single-Sample z Test 
Test 2: The Single-Sample ¢ Test 
2. Inferential statistical tests employed with interval/ratio data for evaluating a 
hypothesis about a population parameter/characteristic other than a mean 
Test 3: The Single-Sample Chi-Square Test for a Population Variance 
Test 4: The Single-Sample Test for Evaluating Population Skewness 
Test 5: The Single-Sample Test for Evaluating Population Kurtosis 
Test 5a: The D’Agostino—Pearson Test of Normality 
Test 10f: The Mean Square Successive Difference Test (for serial 
randomness) 
Test 11e: Procedures for Identifying Outliers 
B. Inferential statistical tests employed with ordinal/rank-order data 
1. Inferential statistical tests employed with ordinal/rank-order data for evaluating 
a hypothesis about the median of a single population, or the distribution of data 
in a single population 
Test 6: The Wilcoxon Signed-Ranks Test 
Test 7: The Kolmogorov-Smirnov Goodness-of-fit Test for a Single 
Sample 
Test 7a: The Lilliefors Test for Normality 
Test 9b: The Single-Sample Test for the Median 
C. Inferential statistical tests employed with categorical/nominal data 
1. Inferential statistical tests employed with categorical/nominal data for evaluating 
a hypothesis about the distribution of data in a single population 
Test 8: The Chi-Square Goodness-of-Fit Test 
Test 9: The Binomial Sign Test for a Single Sample 
Test 9a: Thez Test for a Population Proportion 
Test 10: The Single-Sample Runs Test 
Test 10a: The Runs Test for Serial Randomness 
Test 10b: The Frequency Test (for Randomness) 
Test 10c: The Gap Test (for Randomness) 
Test 10d: The Poker Test (for Randomness) 
Test 10e: The Maximum Test (for Randomness) 
Test 16b: The Chi-Square Test of Independence 


II. Inferential statistical tests employed with two independent samples 
A. Inferential statistical tests employed with interval/ratio data 
1. Inferential statistical tests employed with interval/ratio data for evaluating a 
hypothesis about the means of two independent populations 
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Test 11: The: Test for Two Independent Samples 
Test 11d: Thez Test for Two Independent Samples 
Test 21: The Single-Factor Between-Subjects Analysis of Variance 
Test 21j: The Single-Factor Between-Subjects Analysis of Covariance 
2. Inferential statistical tests employed with interval/ratio data for evaluating a 
hypothesis about variability in two independent populations 
Test 11a: Hartley’s F,,,, Test for Homogeneity of Variance/F Test for 
Two Population Variances 
B. Inferential statistical tests employed with ordinal/rank-order data 
1. Inferential statistical tests employed with ordinal/rank-order data for evaluating 
a hypothesis about the medians, or some other characteristic (other than 
variability) of two independent populations 
Test 12: The Mann-Whitney U Test 
Test 12a: The Randomization Test for Two Independent Samples 
Test 12b: The Bootstrap (can be employed for variability) 
Test 12c: The Jackknife (can be employed for variability) 
Test 13: The Kolmogorov-Smirnov Test for Two Independent Samples 
Test 16e: The Median Test for Independent Samples 
Test 23: The van der Waerden Normal-Scores Test for k Independent 
Samples 
2. Inferential statistical tests employed with ordinal/rank-order data for evaluating 
a hypothesis about variability of two independent populations 
Test 14: The Siegel-Tukey Test for Equal Variability 
Test 15: The Moses Test for Equal Variability 
C. Inferential statistical tests employed with categorical/nominal data 
1. Inferential statistical tests employed with categorical/nominal data for evaluating 
a hypothesis about the distribution of data in two independent populations 
Test 16a: The Chi-Square Test for Homogeneity 
Test 16c: The Fisher Exact Test 
Test 16d: Thez Test for Two Independent Proportions 


III. Inferential statistical tests employed with two dependent samples 
A. Inferential statistical tests employed with interval/ratio data 
1. Inferential statistical tests employed with interval/ratio data for evaluating a 
hypothesis about the means of two dependent populations 
Test 17: The: Test for Two Dependent Samples 
Test 17d: Sandler's A Test 
Test 17e: Thez Test for Two Dependent Samples 
Test 24: The Single-Factor Within-Subjects Analysis of Variance 
2. Inferential statistical tests employed with interval/ratio data for evaluating a 
hypothesis about variability in two dependent populations 
Test 17a: The ¢ Test for Homogeneity of Variance for Two Dependent 
Samples 
B. Inferential statistical tests employed with ordinal/rank-order data 
1. Inferential statistical tests employed with ordinal/rank-order data for evaluating 
a hypothesis about the ordering of data in two dependent populations 
Test 18: The Wilcoxon Matched-Pairs Signed-Ranks Test 
Test 19: The Binomial Sign Test for Two Dependent Samples 
C. Inferential statistical tests employed with categorical/nominal data 
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1. Inferential statistical tests employed with categorical/nominal data for evaluating 
a hypothesis about the distribution of data in two dependent populations 
Test 20: The McNemar Test 
Test 20a: The Bowker Test of Symmetry 


IV. Inferential statistical tests employed with two or more independent samples 
A. Inferential statistical tests employed with interval/ratio data 
1. Inferential statistical tests employed with interval/ratio data for evaluating a 
hypothesis about the means of two or more independent populations which 
involve one independent variable/factor 
Test 21: The Single-Factor Between-Subjects Analysis of Variance 
Test 21a: Multiple ¢ Tests/Fisher's LSD Test 
Test 21b: The Bonferroni-Dunn test 
Test 21c: Tukey's HSD Test 
Test 21d: The Newman-Keuls Test 
Test 21e: The Scheffé Test 
Test 21f: The Dunnett Test 
Test 21j: The Single-Factor Between-Subjects Analysis of Covariance 
2. Inferential statistical tests employed with interval/ratio data for evaluating a 
hypothesis about variability in two or more independent populations 
Test 11a: Hartley's F,,,, Test for Homogeneity of Variance/F Test for Two 
Population Variances 
B. Inferential statistical tests employed with ordinal/rank-order data 
1. Inferential statistical tests employed with ordinal/rank data for evaluating a 
hypothesis about the medians, or some other characteristic of two or more 
independent populations 
Test 16e: The Median Test for Independent Samples 
Test 22: The Kruskal-Wallis One-Way Analysis of Variance by Ranks 
Test 23: The van der Waerden Normal-Scores Test for k Independent 
Samples 
C. Inferential statistical tests employed with categorical/nominal data 
1. Inferential statistical tests employed with categorical/nominal data for evaluating 
a hypothesis about the distribution of data in two or more independent populations 
Test 16a: The Chi-Square Test for Homogeneity 


V. Inferential statistical tests employed with two or more dependent samples 
A. Inferential statistical tests employed with interval/ratio data 
1. Inferential statistical tests employed with interval/ratio data for evaluating a 
hypothesis about the means of two or more dependent populations which involve 
one independent variable/factor. 
Test 24: The Single-Factor Within-Subjects Analysis of Variance 
Test 24a: Multiple ¢ Tests/Fisher's LSD Test 
Test 24b: The Bonferroni-Dunn Test 
Test 24c: Tukey's HSD Test 
Test 24d: The Newman-Keuls Test 
Test 24e: The Scheffé Test 
Test 24f: The Dunnett Test 
B. Inferential statistical tests employed with ordinal/rank-order data 
1. Inferential statistical tests employed with ordinal/rank-order data for evaluating 
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a hypothesis about the medians of two or more dependent populations 
Test 25: The Friedman Two-Way Analysis of Variance by Ranks 
C. Inferential statistical tests employed with categorical/nominal data 
1. Inferential statistical tests employed with categorical/nominal data for evaluating 
a hypothesis about the distribution of data in two or more dependent populations 
Test 26: The Cochran Q Test 


VI. Inferential statistical tests employed with factorial designs 
A. Inferential statistical tests employed with interval/ratio data 
1. Inferential statistical tests employed with interval/ratio data for evaluating a 
hypothesis about the means of two or more populations in a design involving two 
independent variables/factors 
Test 27: The Between-Subjects Factorial Analysis of Variance 
Test 27a: Multiple ¢ Tests/Fisher's LSD Test 
Test 27b: The Bonferroni-Dunn Test 
Test 27c: Tukey's HSD Test 
Test 27d: The Newman-Keuls Test 
Test 27e: The Scheffé Test 
Test 27f: The Dunnett Test 
Test 27i: The Factorial Analysis of Variance for a Mixed Design 
Test 27j: The Within-Subjects Factorial Analysis of Variance 


VII. Measures of correlation/association 
A. Measures of correlation/association employed with interval/ratio data 
1. Bivariate measures 

Test 28: The Pearson Product-Moment Correlation Coefficient (and tests 
for evaluating various hypotheses concerning the value of one or more 
product-moment correlation coefficients or regression coefficients (Tests 
28a-28g) 

2. Multivariate measures 

Test 28k: The Multiple Correlation Coefficient (and the test for evaluating 
the significance of a multiple correlation coefficient (Test 28k-a)) 

Test 281: The Partial Correlation Coefficient (and the test for evaluating 
the significance of a partial correlation coefficient (Test 281-a)) 

Test 28m: The Semipartial Correlation Coefficient (and the test for 
evaluating the significance of a semi-partial correlation coefficient (Test 
28m-a)) 

B. Measures of correlation/association employed with ordinal/rank order data 
1. Bivariate measures/Two sets of ranks 

Test 29: Spearman's Rank-Order Correlation Coefficient (and the test for 
evaluating the significance of Spearman's rank-order correlation 
coefficient (Test 29a) 

Test 30: Kendall’s Tau (and the test for evaluating the significance of 
Kendall's tau (Test 30a)) 

Test 32: Goodman and Kruskal’s Gamma (and the test for evaluating the 
significance of gamma (Test 32a)) 

2. Ordinal measure of association for three or more samples/sets of ranks 

Test 31: Kendall’s Coefficient of Concordance (and the test for evaluating 

the significance of the coefficient of concordance (Test 31a)) 
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C. Measures of correlation/association employed with categorical/nominal data 

Test 16f: The Contingency Coefficient 

Test 16g: The Phi Coefficient 

Test 16h: Cramér's Phi Coefficient 

Test 16i: Yule’s Q 

Test 16j: The Odds Ratio (and test of significance for an odds ratio (Test 

16j-a) 
D. Other bivariate measures of correlation/association (and effect size) employed when 
interval/ratio data are used or implied for at least one variable 

Test 28h: The Point-Biserial Correlation Coefficient (and the test for 
evaluating the significance of a point-biserial correlation coefficient (Test 
28h-a)) 

Test 28i: The Biserial Correlation Coefficient (and the test for evaluating 
the significance of a biserial correlation coefficient (Test 28i-a)) 

Test 28j: The Tetrachoric Correlation Coefficient (and the test for evalu- 
ating the significance of a tetrachoric correlation coefficient ( Test 28j-a)) 

Tests 11c/17c/21g/24g/27g: Omega Squared 

Test 21h: Eta Squared 

Test 11b/17b: Cohen's d Index (and Test 2a for one variable) 

Test 21i/24h/27h: Cohen's f Index 


VIII. Additional procedures 
A. Meta-analytic procedures 

Test 28n: Procedure for comparing k studies with respect to significance 
level 

Test 280: The Stouffer procedure for obtaining a combined significance 
level for k studies 

Test 28p: Procedure for comparing k studies with respect to effect size 

Test 28q: Procedure for obtaining a combined effect size for k studies 
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Guidelines and Decision Tables for Selecting 
the Appropriate Statistical Procedure 


Tables I.8-I.11 are designed to facilitate the selection of the appropriate statistical test. 
Tables I.8-I.10 list the major inferential statistical procedures described in the book, based on 
the level of measurement the data being evaluated represent. Specifically, Table I.8 lists 
inferential statistical tests employed with interval/ratio data, Table I.9 lists inferential statistical 
tests employed with ordinal/rank-order data,* and Table I.10 lists inferential statistical tests 
employed with categorical/nominal data. Table I.11 lists the measures of correlation/association 
that are described in the book. Using the aforementioned tables, the following guidelines should 
be employed in selecting the appropriate statistical test. 


1. Determine if the analysis involves computing a correlation coefficient/measure of association 
and, if it does, go to Table I.11. The selection of the appropriate measure in Table I.11 is 
based on the level of measurement represented by each of the variables for which the measure 
of correlation/association is computed. 

2. If the analysis does not involve computing a measure of correlation/association, it will be 
assumed that the data will be evaluated through use of an inferential statistical test. To select 
the appropriate inferential statistical test, the following protocol should be employed. 

a) State the general hypothesis that is being evaluated. 

b) Determine if the study involves a single sample or more than one sample. 

c) If the study involves a single sample, the appropriate test will be one of the tests for a 
single sample in Tables I.8, I.9, or I.10. In order to determine which table to employ, de- 
termine the level of measurement represented by the data that are being evaluated. If the 
level of measurement is interval/ratio, Table I.8 is employed. If the level of measurement 
is ordinal/rank-order, Table I.9 is appropriate. If the level of measurement is categorical/ 
nominal, Table I.10 is utilized. 

d) If there is more than one sample, determine how many samples/treatments there are and 
whether they are independent or dependent. Determine the level of measurement repre- 
sented by the data that are being evaluated (which represents the dependent variable in 
the study). 

1) Ifthe level of measurement is interval/ratio, go to Table I.8. Identify the test or tests 
that are appropriate for that level of measurement with respect to the number and type 
of samples employed in the study. 

2) Ifthe level of measurement is ordinal/rank-order, go to Table I.9. Identify the test or 
tests that are appropriate for that level of measurement with respect to the number and 
type of samples employed in the study. 

3) Ifthe level of measurement is categorical/nominal, go to Table I.10. Identify the test 
or tests that are appropriate for that level of measurement with respect to the number 
and type of samples employed in the study. 


* [n the case of the following three tests listed in Table I.9, the dependent variable will be interval/ratio data 
which is converted into a format in which the resulting scores are rank-ordered: The Wilcoxon signed- 
ranks test (Test 6), the Moses test for equal variability (Test 15), and the Wilcoxon matched-pairs 
signed-ranks test (Test 18). 
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Table I.8 Decision Table for Inferential Statistical Tests Employed with Interval/Ratio Data 


Number of samples 
One independent Hypothesis evaluated Test 
variable 


Hypothesis about a population The single-sample z test (Test 1) (o known) 
mean The single-sample f test (Test 2) (c unknown) 
The single-sample chi-square test for a population 
variance (Test 3) 
The single-sample test for evaluating population 
Single sample Hypothesis about a population skewness (Test 4) : ; 
de The single-sample test for evaluating population 
parameter/characteristic other . 
kurtosis (Test 5) 
than the mean 


The mean square successive difference test (for 
serial randomness) (Test 10f) 

The D'Agostino- Pearson test of normality (Test 5a) 
Procedures for identifying outliers (Test 11e) 


The t test for two independent samples (Test 11) 
Hypothesis about difference The z test for two independent samples (Test 11d) 
] The single-factor between-subjects analysis of 
T between two independent 
; bh opulation means (rest) 
independent |P9P ! The single-factor between-subjects analysis of 
samples covariance (Test 21j) 
Hypothesis about two Hartley's Fmax test for homogeneity of variance/ 
Two independent population variances |F test for two population variances (Test 11a) 
samples 
The ¢ test for two dependent samples (Test 17) 
Hypothesis about difference Sandler's A test (Test 17d) 
Two between two dependent The z test for two dependent samples (Test 17e) 


population means The single-factor within-subjects analysis of 
variance (Test 24) 


Hypothesis about two dependent |The : test for homogeneity of variance for two 
population variances dependent samples (Test 17a) 


The single-factor between-subjects analysis of 
variance (Test 21) 

The single-factor between-subjects analysis of 
covariance (Test 21j) 


samples 
P Hypothesis about two or more Hartley’s F pax test for homogeneity of variance/ 
Two independent population variances |F test for two population variances (Test 11a) 
or more 
samples Hypothesis about difference 
between two or more dependent 
Two or more |population means 


dependent 
samples 


dependent 
samples 


Hypothesis about difference 
between two or more independent 
population means 


Two or more 
independent 


The single-factor within-subjects analysis of 
variance (Test 24) 


See discussion of sphericity assumption under the 
single-factor within-subjects analysis of variance 
(Test 24) 


Hypothesis about two or more 
dependent population variances 


The between-subjects factorial analysis of variance 
(Test 27) 

The factorial analysis of variance for a mixed design 
(Test 271) 

The within-subjects factorial analysis of variance 
(Test 27j) 


Hypothesis about difference 
between two or more population 
means 


Two independent 
variables 
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Table 1.9 Decision Table for Inferential Statistical Tests Employed 
with Ordinal/Rank-Order Data 


The Wilcoxon signed-ranks test (Test 6) 
Hypothesis about a population The Kolmogorov-Smirnov goodness-of-fit test for 
Single sample median or the distribution of data |a single sample (Test 7) 
in a single population The Lilliefors test for normality (Test 7a) 
The single-sample test for the median (Test 9b) 


The Mann-Whitney U test (Test 12) 
The randomization test for two independent 
samples (Test12a) 
The bootstrap (Test 12b) (can be employed for 

Hypothesis about two independent |variability) 

population medians, or some other |The jackknife (Test 12b) (can be employed for 

characteristic (other than variability) 

Two variability) of two independent The Kolmogorov-Smirnov test for two independent 
independent |populations samples (Test 13) 
samples The median test for independent samples 
Two (Test 16e) 
samples The van der Waerden normal-scores test for k 

independent samples (Test 23) 


The Siegel-Tukey test for equal variability (Test 
14) 
The Moses test for equal variability (Test 15) 


Hypothesis about variability in 
two independent populations 


The Wilcoxon matched-pairs signed-ranks test 
Hypothesis about the ordering of |(Test 18) 
data in two dependent populations |The binomial sign test for two dependent samples 
(Test 19) 


Two 
dependent 
samples 


The Kruskal-Wallis one-way analysis of variance 
by ranks (Test 22) 

The van der Waerden normal-scores test for k 
independent samples (Test 23) 

The median test for independent samples (Test 16e) 


Hypothesis about two or more 
independent population medians, 
or some other characteristic of two 
or more independent populations 


Two or more 
independent 


Two samples 
or more 


samples 
Two or more 


dependent 
samples 


Hypothesis about two or more The Friedman two-way analysis of variance by 
dependent population medians ranks (Test 25) 
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Table I.10 Decision Table for Inferential Statistical Tests Employed 
with Categorical/Nominal Data 


The chi-square goodness-of-fit test (Test 8) 
The binomial sign test for a single sample (Test 9) 
The z test for a population proportion (Test 9a) 
The single-sample runs test (Test 10) 
Hypothesis about distribution of |The runs test for serial randomness (Test 10a) 
data in a single population The frequency test (for randomness) (Test 10b) 
The gap test (for randomness) (Test 10c) 
The poker test (for randomness) (Test 10d) 
The maximum test (for randomness) (Test 10e) 
The chi-square test of independence (Test 16b) 


Single sample 


Two Hypothesis about distribution of |The chi-square test for homogeneity (Test 16a) 
independent |data in two independent The Fisher exact test (Test 16c) 


samples [populations The z test for two independent proportions (Test 16d) 


Two 


samples Two Hypothesis about distribution of 


dependent |data in two dependent 
samples  |populations 


The McNemar test (Test 20) 
The Bowker test of symmetry (Test 20a) 


Two or 

more 
independent 
Two samples 
or more 

samples Two or 

more 
dependent 
samples 


Hypothesis about distribution of 
data in two or more independent |The chi-square test for homogeneity (Test 16a) 
populations 


Hypothesis about distribution of 
of data in two or more The Cochran Q test (Test 26) 
dependent populations 
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Table 1.11 Decision Table for Measures of Correlation/Association 


The Pearson product-moment correlation coefficient (Test 28) 


Interval/ratio The multiple correlation coefficient (Test 28k) 
data Multivariate |The partial correlation coefficinet (Test 281) 
The semipartial correlation coefficient (Test 28m) 


Spearman's rank-order correlation coefficient (Test 29) 


Bivariate/two sets Kendall's tau (Test 30) 


of ranks 


Ordinal/rank Goodman and Kruskal's gamma (for ordered contingency tables) (Test 32) 


order data More than two 


samples/sets of |Kendall's coefficient of concordance (Test 31) 
ranks 


The contingency coefficient (Test 16f) 
Two dichotomous | The phi coefficient (Test 16g) 
variables Yule’s Q (Test 161) 
Categorical/ The odds ratio (Test 16j) 
nominal data 
Two The contingency coefficient (Test 16f) 
nondichotomous |Cramér's phi coefficient (Test 16h) 
variables The odds ratio (Test 16j) 


Omega squared (One variable, interval/ratio data; second variable, two or more 
nominal levels) (Tests 11c/17c/21g/24g/27g) 

Eta squared (One variable, interval/ratio data; second variable, two or more 
nominal levels) (Test 21h) 

Cohen's d index (Test 11b/17b) (One variable, interval/ratio data; second 
variable, two nominal levels) (with Test 2a for one variable) 

Cohen's f index (One variable, interval/ratio data; second variable, two or 
more nominal levels) (Test21i/24h/27h) 

The point-biserial correlation coefficient (One variable, interval/ratio data; 
second variable represented by dichotomous categories) (Test 28h) 

The biserial correlation coefficient (One variable, interval/ratio data; second 
variable, an interval/ratio variable expressed in form of dichotomous 
categories) (Test 28i) 

The tetrachoric correlation coefficient (Two interval/ratio variables, both of 
which are expressed in the the form of dichotomous categories) (Test 28j) 


Other bivariate correlational 
measures for which interval ratio/ 
data are employed or implied for 
at least one of the variables 
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Inferential Statistical Tests 
Employed with a Single Sample 


Test 1: 


Test 2: 


Test 3: 


Test 4: 


Test 5: 


Test 6: 


Test 7: 


Test 8: 


Test 9: 


Test 10: 


The Single-Sample z Test 
The Single-Sample t Test 


The Single-Sample Chi-Square Test for 
a Population Variance 


The Single-Sample Test for Evaluating 
Population Skewness 


The Single-Sample Test for Evaluating 
Population Kurtosis 


The Wilcoxon Signed-Ranks Test 


The Kolmogorov-Smirnov Goodness-of-Fit 
Test for a Single Sample 


The Chi-Square Goodness-of-Fit Test 
The Binomial Sign Test for a Single Sample 


The Single-Sample Runs Test (and Other 
Tests of Randomness) 


€ 2000 by Chapman & Hall/CRC 


Test 1 


The Single-Sample z Test 
(Parametric Test Employed with Interval/Ratio Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Does a sample of n subjects (or objects) come from a popula- 
tion in which the mean (u) equals a specified value? 


Relevant background information on test The single-sample z test is employed in a hy- 
pothesis testing situation involving a single sample in order to determine whether or not a sample 
with a mean of X is derived from a population with a mean of u. If the result of the single- 
sample z test yields a significant difference, the researcher can conclude there is a high 
likelihood the sample is derived from a population with a mean value other than u. The test 
statistic for the single-sample z test is based on the normal distribution. A general discussion 
of the normal distribution can be found in the Introduction. 

The single-sample z test is used with interval/ratio data. The test should only be employed 
if the value of the population standard deviation (6) is known. In the event the value of o is 
unknown, the data should be evaluated with the single-sample t test (Test 2). The reader should 
take note of the fact that some sources argue that even when one knows the value of o, if the 
sample size is very small, the single-sample ¢ test provides a more accurate estimate of the 
underlying sampling distribution for the data. Sources that take the latter position are not in 
agreement with respect to the minimum sample size above which it is acceptable to employ the 
single-sample z test (although it is usually n » 25). 

The single-sample z test is based on the following assumptions: a) The sample has been 
randomly selected from the population it represents; and b) The distribution of data in the under- 
lying population the sample represents is normal. If either of the aforementioned assumptions 
is saliently violated, the reliability of the z test statistic may be compromised. 


II. Example 

Example 1.1. Thirty subjects take a test of visual-motor coordination for which the value of the 
population mean is u = 8, and the value of the population standard deviation is o = 2. If the 
average score of the sample of 30 subjects equals 7.4 (i.e., X = 7.4), can one conclude that the 
sample, in fact, came from a population in which the mean is u = 8? 


III. Null versus Alternative Hypotheses 


Null hypothesis Hy: w = 8 


(The mean of the population the sample represents equals 8.) 
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Alternative hypothesis H,: wp # 8 


(The mean of the population the sample represents does not equal 8. This is a nondirectional 
alternative hypothesis, and it is evaluated with a two-tailed test. In order to be supported, the 
absolute value of z must be equal to or greater than the tabled critical two-tailed z value at the 
prespecified level of significance. Thus, either a significant positive z value or a significant 
negative z value will provide support for this alternative hypothesis.) 


or 

H,: « > 8 
(The mean of the population the sample represents is greater than 8. This is a directional alter- 
native hypothesis, and it is evaluated with a one-tailed test. It will only be supported if the sign 


of Z is positive, and the absolute value of z is equal to or greater than the tabled critical one-tailed 
z value at the prespecified level of significance.) 


or 

Ay: wp < 8 
(The mean of the population the sample represents is less than 8. This is a directional alter- 
native hypothesis, and it is evaluated with a one-tailed test. It will only be supported if the sign 


of zis negative, and the absolute value of z is equal to or greater than the tabled critical one-tailed 
z value at the prespecified level of significance.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected. 


IV. Test Computations 


Assume that the following values represent the scores of the sample of n = 30 subjects who take 
the test of visual-motor coordination in Example 1.1: 9, 10, 6, 4, 8, 11, 10, 5, 5, 6, 13, 12, 4, 4, 
3, 9, 12, 5, 6, 6, 8, 9, 8, 5, 7, 9, 10, 9, 5, 4. 

Since X, can be employed to represent the score of the i "' subject, by adding all thirty 
scores we obtain: XX, - XX - 222. 

Equation 1.1 is used to compute the mean of the sample. 


(Equation 1.1) 


Employing Equation 1.1, we confirm that the mean of the sample is X-74 , the value 
stated in Example 1.1. 
222 


X = £55 -74 
3 


Before the test statistic can be computed, it is necessary to compute a value that is referred 
to as the standard error of the population mean. This value, which is represented by the 


notation oy, is computed with Equation 1.2. A full explanation of what 0; represents can be 
found in Section VII. 


(Equation 1.2) 
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Substituting the values o = 2 and n = 30 in Equation 1.2, the value of og = .36 is 
computed. 


It should be noted that og can never be a negative value. If a negative value is obtained for 
Ox, it indicates a computational error has been made. 

Equation 1.3 is employed to compute the value of z, which is the test statistic for the single- 
sample z test. Note that in Equation 1.3, the value that represents u is the value u = 8 which is 
stated in the null hypothesis. 





Z= (Equation 1.3) 


Employing Equation 1.3, the value z = —1.67 is computed for Example 1.1. 


Z= CG 1.67 
.36 
Note that Equation 1.3 will always yield a positive z value when the sample mean is greater 
than the hypothesized value of u. The value of z will always be negative when the sample mean 
is less than the hypothesized value of u. When the sample mean is equal to the hypothesized 
value of u, z will equal zero. 


V. Interpretation of the Test Results 
The obtained value z = —1.67 is evaluated with Table A1 (Table of the Normal Distribution) 
in the Appendix. Table 1.1 summarizes the tabled critical two-tailed and one-tailed .05 and .01 z 


values listed in Table A1. 


Table I.1 Tabled Critical Two-Tailed and One-Tailed .05 and .01 z Values 


Z gs £o 
Two-tailed values 1.96 2.58 
One-tailed values 1.65 2.33 


The following guidelines are employed in evaluating the null hypothesis for the single- 
sample z test. 

a) If the alternative hypothesis employed is nondirectional, the null hypothesis can be re- 
jected if the obtained absolute value of z is equal to or greater than the tabled critical two-tailed 
value at the prespecified level of significance. 

b) If the alternative hypothesis employed is directional and predicts a population mean 
larger than the value stated in the null hypothesis, the null hypothesis can be rejected if the sign 
of z is positive, and the value of z is equal to or greater than the tabled critical one-tailed value 
at the prespecified level of significance. 

c) If the alternative hypothesis employed is directional and predicts a population mean 
smaller than the value stated in the null hypothesis, the null hypothesis can be rejected if the sign 
of z is negative, and the absolute value of z is equal to or greater than the tabled critical one-tailed 
value at the prespecified level of significance. 

Employing the above guidelines, we can only reject the null hypothesis if the directional 
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alternative hypothesis H,: u < 8 is employed, and the null hypothesis can only be rejected at 
the .05 level. This is the case, since the obtained value of z is a negative number, and the 
absolute value of z is greater than the tabled critical one-tailed .05 value zy, = 1.65 | The 
alternative hypothesis H,: u < 8 is not supported at the .01 level, since the absolute value z = 
1.67 is not greater than the tabled critical one-tailed .01 value z,, = 2.33. 

The nondirectional alternative hypothesis H,:  # 8 is not supported, since the obtained 
absolute value z = 1.67 is less than the tabled critical two-tailed .05 value z,, = 1.96. 

The directional alternative hypothesis H,: p > 8 is not supported, since the obtained 
value z = —1.67 is a negative number. In order for the alternative hypothesis H,: p > 8 to be 
supported, the computed value of z must be a positive number (as well as the fact that it must 
be equal to or greater than the tabled critical one-tailed value at the prespecified level of signifi- 
cance). 

A summary of the analysis of Example 1.1 with the single-sample z test follows: With 
respect to the test of visual-motor coordination, we can conclude that the sample of 30 subjects 
comes from a population with a mean value other than 8 only if we employ the directional 
alternative hypothesis H,: p < 8, and prespecify as our level of significance a = .05. This 
result can be summarized as follows: z 2 —1.67, p < .05. 

A more in-depth discussion of the interpretation of the z value computed with the single- 
sample z test is contained in Section VII. 


VI. Additional Analytical Procedures for the Single-Sample z Test 
and/or Related Tests 


Procedures are available for computing power and confidence intervals for the single-sample 
z test. These computations are discussed in Section VI of the single-sample ¢ test (which 
employs the same protocol for such computations as does the single-sample z test). 


VII. Additional Discussion of the Single-Sample z Test 


1. The interpretation of a negative z value The actual range of scores on the abscissa (i.e., 
the X-axis) of the standard normal distribution is -œ < z < +œ. The guidelines outlined in Section 
V for interpreting negative z values are intended to provide the reader with the simplest and least 
confusing protocol for interpreting such values. In terms of the actual distribution of z values, 
it should be noted that although the tabled critical z values listed in Table 1.1 are positive 
numbers, they are also applicable to interpreting negative z values. Since the critical values 
recorded in Table 1.1 represent absolute values, the corresponding negative z values are listed 
in Table 1.2. 


Table 1.2 Tabled Critical Two-Tailed and One-Tailed .05 and .01 Negative z Values 


205 £o 
Two-tailed values —1.96 —2.58 
One-tailed values -1.65 —2.33 


Within the framework of the values noted in Table 1.2, if one employs the directional (one- 
tailed) alternative hypothesis H,: y < 8,in order to reject the null hypothesis, the obtained value 
of z must be a negative number that is equal to or less than the prespecified tabled critical 
value. Thus, to be significant at the .05 level, the obtained z value would have to be equal to or 
less than z = —1.65. The reader should take note of the fact that any negative number which has 
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an absolute value greater than 1.65 is less than —1.65. In the same respect, in order for the 
alternative hypothesis H,: u < 8 to be supported at the .01 level, the obtained z value would 
have to be equal to or less than z 2 —2.33, since any negative number which has an absolute value 
greater than 2.33 is less than —2.33. The important thing for the reader to understand is that when 
one is dealing with a negative number, the larger the absolute value of the negative number, the 
lower the value of that number. 


2. The standard error of the population mean and graphical representation of the results 
of the single-sample z test The intent of this section is to provide further clarification of what 
the z value computed with the single-sample z test represents. In order to do this, it is necessary 
to understand what is represented by the standard error of the population mean (0), which 
is the denominator of Equation 1.3. The standard error of the population mean represents a 
standard deviation of a sampling distribution of means. Although such a sampling distribution 
is theoretical and is based on an infinite number of samples, it is possible to construct an 
empirical sampling distribution that is based on a smaller number of sample means. In order to 
construct such a sampling distribution of means, a random sample consisting of n subjects is 
drawn from a population of N subjects. Upon doing this, the mean of the sample of n subjects 
is computed. Once again, employing the whole population of N subjects, a second random 
sample consisting of n subjects is selected, and the mean of that sample is computed. This 
process is repeated over and over again. At whatever point one decides to terminate the process, 
a large number of sample means will have been computed, each of which is based on a sample 
size of n subjects. The frequency distribution of these sample means (which will be distributed 
normally) is known as a sampling distribution of means. The mean of a sampling distribution 
(represented by the notation pz) that is based on an infinite number of sample means will be the 
same value as the population mean (u). As the number of sample means used to construct a 
sampling distribution increases, the greater the likelihood that the computed value of the mean 
of the sampling distribution equals the value of u. The standard deviation of a sampling 
distribution (i.e., the standard deviation of all of the sample means), is the standard error of the 
population mean (ox), which in many sources is referred to as the standard error of the 
mean.” 

The z value computed with Equation 1.3 represents the number of standard deviation units 
(based on the value of oy) that the sample mean deviates from the hypothesized population 
mean. Thus in Example 1.1, the value og = .36 represents the standard deviation of a sampling 
distribution of means in which in each sample n = 30. The obtained value z = —1.67 indicates that 
X = 7.4 (the sample mean) is 1.67 sampling distribution standard deviation units below the 
hypothesized population mean p = 8 (which as noted earlier has the same value as uz). The 
difference is statistically significant, since a sample mean of 7.4 obtained with 30 subjects is a 
relatively unlikely occurrence in a sampling distribution that has a mean of 8 and a standard 
deviation of .36. If we make the assumption that the distribution of means in the sampling dis- 
tribution is normal, use of the single-sample z test will lead to the conclusion that if, in fact, the 
true value of the population mean is 8, the likelihood of obtaining a sample mean equal to or less 
than 7.4 is less than .05 (to be exact, it equals .0475). 

Figure 1.1 provides a visual description of the sampling distribution of means for Example 
1.1. In Figure 1.1, the numbers 6.92, 7.28, ..., 9.08 along the abscissa identify the values a 
sample mean will assume if it is 1, 2, and 3 standard deviation (sd) units below and above the 
mean of the sampling distribution. Since the value of one standard deviation unit for the 
sampling distribution under discussion is equal to og = .36, the value 8.36, which is one 
standard deviation unit above the mean of the distribution, is obtained simply by adding the value 
Oy -.36 to uz = 8. The value 8.72, which is two standard deviations above the mean, is 
obtained by adding two times the value of ox to uz = 8, and so on. 
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Figure 1.1 Sampling Distribution for Example 1.1 


Since almost 10046 of the cases in a normal distribution fall within three standard deviation 
units above or below the mean,’ if a researcher has a sample of 30 subjects that is derived from 
a population in which u = 8 and o = 2, he can be almost 100% sure that the mean of the sample 
will fall between the values 6.92 and 9.08 (which are the values that correspond to the —3 sd and 
+3 sd points in Figure 1.1). A mean value outside of this range is highly unlikely to occur and, 
if a more extreme mean value does occur, it is reasonable to conclude that the sample is derived 
from a population which had a mean value other than 8. 

Earlier in the discussion of a sampling distribution it was noted that it is assumed that the 
sample means are normally distributed. The basis for this statement is a general principle in 
mathematics known as the central limit theorem. The central limit theorem states that in a 
population with a mean value of u and a standard deviation of o, the following will be true with 
respect to the sampling distribution of means: a) The sampling distribution will have a mean 
value equal to u and a standard deviation equal to og = o/ yn ; and b) The sampling distribution 
approaches being normal as the size (n) of each of the samples employed in generating the 
sampling distribution increases, and as the total number of means used to generate the sampling 
distribution increases. Although the underlying population each of the samples is derived from 
does not, in itself, have to be normal, the more it approximates normality the lower the value of 
n required for the sampling distribution to be normal. In addition, the more the underlying popu- 
lation each of the samples is derived from approaches normality, the fewer sample means will 
be required before the sampling distribution becomes normal. 

Based on what has been said with respect to the standard error of the population mean, one 
can determine the value that a sample mean will have to be equal to or more extreme than in 
order to reject a null hypothesis at a prespecified level of significance. Figure 1.2 depicts these 
values for Example 1.1 in reference to the tabled critical one- and two-tailed .05 and .01 z values. 
Note that in each of the graphs, the value that is written directly below the tabled critical z value 
for the relevant level of significance is the value the sample mean will have to be equal to or 
more extreme than in order to reject the null hypothesis Hy: p = 8. 

The values that the sample mean must be equal to or more extreme than in Figure 1.2 are 
computed with Equation 1.4, which is the result of algebraically transposing the terms in 
Equation 1.3 in order to solve for the value of X. 


X-pn*z0y (Equation 1.4) 
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Figure 1.2a Distribution of Critical Two-Tailed .05 z Value for H,: u + 8 
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Figure 1.2b Distribution of Critical Two-Tailed .01 z Value for H,: u + 8 
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Figure 1.2c Distribution of Critical One-Tailed .05 z Value for H,: u > 8 
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Figure 1.2f Distribution of Critical One-Tailed .01 z Value for H,: u «8 


© 2000 by Chapman & Hall/CRC 


The value employed to represent z in Equation 1.4 is the relevant tabled critical z value at 
the prespecified level of significance. By multiplying the latter value by og and adding and 
subtracting the product from the value of the population mean, one is able to compute the upper 
limit that the sample mean must be equal to or greater than and the lower limit that the sample 
mean must be equal to or less than in order for a result to be significant. This is illustrated below 
for the case depicted in Figure 1.2a, which describes the upper and lower limits for the sample 
mean when the nondirectional alternative hypothesis H,: u + 8 is employed, with œ = .05. 


X = 8 + (1.96)(.36) = 8 + .71 


Since 8 + .71 = 8.71 and 8 — .71 = 7.29, in order to be significant at the .05 level, a sample 
mean will have to be equal to or greater than 8.71, or equal to or less than 7.29. This result can 
be summarized as follows: 7.29 > X > 8.71. 

A summary of the results depicted in Figure 1.2 follows: 

Figure 1.2a: If the nondirectional alternative hypothesis H,: u * 8 is employed, with a 
= .05, in order to reject the null hypothesis, the obtained value of the sample mean will have to 
be equal to or greater than 8.71 or be equal to or less than 7.29. 

Figure 1.2b: If the nondirectional alternative hypothesis H,: u * 8 is employed, with 
a = .01, in order to reject the null hypothesis, the obtained value of the sample mean will have 
to be equal to or greater than 8.93 or be equal to or less than 7.07. 

Figure 1.2c: Ifthe directional alternative hypothesis H,: p > 8 isemployed, with a = .05, 
in order to reject the null hypothesis, the obtained value of the sample mean will have to be equal 
to or greater than 8.59. 

Figure 1.2d: If the directional alternative hypothesis H,: u > 8 isemployed, witha=.01, 
in order to reject the null hypothesis, the obtained value of the sample mean will have to be equal 
to or greater than 8.84. 

Figure 1.2e: If the directional alternative hypothesis H,: p < 8 isemployed, with a=.05, 
in order to reject the null hypothesis, the obtained value of the sample mean will have to be equal 
to or less than 7.41. 

Figure 1.2f: If the directional alternative hypothesis H,: p < 8 is employed, witha=.01, 
in order to reject the null hypothesis, the obtained value of the sample mean will have to be equal 
to or less than 7.16. 

Note that with respect to a specific alternative hypothesis in the above examples, the lower 
the value of alpha, the larger the value computed for an upper limit and the lower the value 
computed for a lower limit. Additionally, if the value of alpha is fixed, the computed value for 
an upper limit will be higher and the computed value for a lower limit will be lower when a 
nondirectional alternative hypothesis is employed, as opposed to when a directional alternative 
hypothesis is used. 


3. Additional examples illustrating the interpretation of a computed z value To further 
clarify the interpretation of z values, Table 1.3 lists three additional z values that could have been 
obtained for Example 1.1 if a different set of data had been employed. Table 1.3 notes the deci- 
sions that would be made with reference to the three possible alternative hypotheses a researcher 
could employ on the basis of each of these z values. The table assumes that Hy: p = 8. 


4. The z test for a population proportion Another test that employs the normal distribution 
in order to analyze data derived from a single sample is the z test for a population proportion 
(Test 9a). Equation 9.6, the equation for computing the test statistic for the z test for a popu- 
lation proportion, is a special case of Equation 1.3. The use of Equation 9.6 is reserved for 
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evaluating a set of scores for a binomially distributed variable (for which the values of u and o 
can be determined). Thez test for a population proportion is discussed after a full discussion 
of the binomial distribution (which can be found under the binomial sign test for a single 
sample (Test 9)). 


Table 1.3 Decision Table for z Values 


Obtained Alternative 
zvalue hypothesis Decision 


1.75 H,: y * 8 Thenull hypothesis cannot be rejected, since the obtained value z = 1.75 is less 
than the tabled critical two-tailed .05 and .01 values Z; = 1.96 and 


Zo, = 2.58. 


H: p > 8 The null hypothesis can be rejected at the .05 level of significance, since the 
obtained value z = 1.75 is a positive number which is greater than the tabled 
critical one-tailed .05 value z,, = 1.65. The null hypothesis cannot be rejected 
at the .01 level, since it is less than the tabled critical one-tailed .01 value 


Zo, = 2-33. 


H,: p < 8 The null hypothesis cannot be rejected, since the obtained value z = 1.75 isa 
positive number. 


-2.75 H,: p * 8 The null hypothesis can be rejected at both the .05 and .01 levels of significance, 
since the obtained absolute value z - 2.75 is greaterthan the tabled critical two- 
tailed .05 and .01 values Z, = 1.96 and z,, = 2.58. 


H: p > 8 The null hypothesis cannot be rejected, since the obtained value z = -2.75 is 
a negative number. 


H: p < 8 The null hypothesis can be rejected at both the .05 and .01 levels of significance, 
since the obtained value z = -2.75 is a negative number and the absolute value 
z = 2.75 is greater than the tabled critical one-tailed .05 and .01 values 


Zo = 1.65 and z,, = 2.33. 

75, H: p * 8 The null hypothesis cannot be rejected, since the obtained value z = .75 is less 
than the tabled critical two-tailed .05 and .01 values z,, = 1.96 and 
Zo, = 2.58. 


H,: p > 8 The null hypothesis cannot be rejected, since the obtained value z = .75 is less 
than the tabled critical one-tailed .05 and .01 values z,, = 1.65 and 


Zo = 2-33. 


y A < 8 The null hypothesis cannot be rejected, since the obtained value z = .75 isa 
positive number. 


VIII. Additional Examples Illustrating the Use of the Single- 
Sample z Test 


Five additional examples that can be evaluated with the single-sample z test are presented in this 
section. Since Examples 1.2-1.4 employ the same population parameters and data set used in 
Example 1.1, they yield the identical result. Note that Examples 1.2 and 1.3 employ objects in 
lieu of subjects. Examples 1.5 and 1.6 illustrate the application of the single-sample z test when 
the size of the sample is n = 1. 


Example 1.2. The Brite battery company manufactures batteries which are programmed by a 
computer to have an average life span of 8 months and a standard deviation of 2 months. If the 
average life of a random sample of 30 Brite batteries purchased from 30 different stores is 7.4 
months, are the data consistent with the mean value parameter programmed into the computer? 
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Example 1.3. The Smooth Road cement company stores large quantities of its cement in 30 
storage tanks. A state law says that the machine which fills the tanks must be calibrated so as 
not to deviate substantially from a mean load of 8 tons. It is known that the standard deviation 
of the loads delivered by the machine is 2 tons. An inspector visits the storage facility and 
determines that the mean number of tons in the 30 storage tanks is 7.4 tons. Does this conform 
to the requirements of the state law? 


Example 1.4. A study involving 30 subjects is conducted in order to determine the subjects’ 
ability to accurately judge weight. In the study subjects are required (by adding or subtracting 
sand) to adjust the weight of a cylinder, referred to as the variable stimulus, until it is judged 
equal in weight to a standard comparison cylinder whose weight is fixed. The weight of the 
standard comparison stimulus is 8 ounces. Prior research has indicated that the standard devi- 
ation of the subjects’ judgements in such a task is 2 ounces. Prior to testing the subjects, the 
experimenter decides she will conclude that a kinesthetic illusion occurs if the mean of the 
subjects' judgements differs significantly from the value of the standard stimulus. If the average 
weight assigned by the 30 subjects to the variable stimulus is 7.4 ounces, can the experimenter 
conclude that a kinesthetic illusion has occurred? 


Example 1.5. A meteorologist determines that during the current year there were 80 major 
storms recorded in the Western Hemisphere. He claims that 80 storms represent a significantly 
greater number than the annual average. Based on data that have been accumulated over the 
past 100 years, it is known that on the average there have been 710 major storms in the Western 
Hemisphere, and the standard deviation is 2. (We will assume the distribution for the number 
of storms per year is normal.) Do 80 storms represent a significant deviation from the mean 
value of 70? 


Example 1.5 illustrates a problem that would be evaluated with the single-sample z test in 
which the value of n is equal to one. In the example, the sample size of one represents the single 
year during which there were 80 storms. When the value of n = 1, Equation 1.3 becomes Equa- 
tion 1.5 (which is the same as Equation I.27 in the Introduction). 





(Equation 1.5) 


Equation 1.5, the equation for converting a raw score into a standard deviation (z) score, 
allows one to determine the likelihood of a specific score occurring within a normally distributed 
population. Thus, within the context of Example 1.5, Equation 1.5 will allow the meteorologist 
to determine the likelihood that a score of 80 will occur in a normally distributed population in 
which u = 70 and o = 2. The analysis assumes that within the total population there are N = 100 
scores (where each score represents the number of storms in a given year during the 100-year 
period). Since the frequency of storms during one year is being compared to the population 
mean, the value of n = 1. Note that when n = 1, the value of the sample mean ( X) in Equation 
1.3 reduces to the value X in Equation 1.5. Additionally, since n = 1, the value of oy in Equation 
1.3 becomes o in Equation 1.5, since oz = o//n = o/J1 = o. 

Employing Equations 1.3/1.5 with the data for Example 1.5, the value z 2 5 is computed. 

"e 80-70 5 
2 

The null hypothesis employed for the above analysis is Hy: » = 70. Example 1.5 implies 

that either the directional alternative hypothesis H,: u > 70 or the nondirectional alternative 
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hypothesis H,: p * 70 can be employed. Regardless of which of these alternative hypotheses 
is employed, since the computed value z = 5 is greater than all of the tabled critical values in 
Table 1.1, the null hypothesis can be rejected at both the .05 and .01 levels. Thus, the mete- 
orologist can conclude that a significantly greater number of storms were recorded during the 
current year than the mean value recorded for the past 100 years. The directional alternative 
hypothesis H,: p < 70 is not supported, since, in order to support the latter alternative 
hypothesis, the computed value of z must be a negative number (which will only be the case if 
the number of storms observed during the year is less than the population mean of p = 70). 


Example 1.6. A physician assesses the level of a specific chemical in a 40-year-old male 
patient's blood. Assume that the average level of the chemical in adult males is 70 milligrams 
(per 100 milliliters), with a standard deviation of 2 milligrams (per 100 milliliters). If the patient 
has a blood reading of 80, will the patient be viewed as abnormal? 


As is the case in Example 1.5, Example 1.6 also employs a sample size of n = 1. Since this 
example uses the same data as Example 1.5, the computed value z = 5 is obtained, thus allowing 
the physician to reject the null hypothesis. The value z = 5 indicates that the patient has a blood 
reading that is five standard deviation units above the population mean. The proportion of cases 
in the normal distribution associated with a z score of 5 or greater is so small that there is no 
tabled value listed for z = 5 in Table A1. 


Reference 


Freund, J. E. (1984). Modern elementary statistics (6th ed.). Englewood Cliffs, NJ: Prentice- 
Hall, Inc. 


Endnotes 


1. The exact probability value recorded for z = 1.67 in Column 3 of Table A1 is .0475 (which 
is equivalent to 4.75%). This indicates that the proportion of cases that falls above the value 
z = 1.67 is .0475, and the proportion of cases that falls below the value z = —1.67 is .0475. 
Since this indicates that in the left tail of the distribution there is less than a 596 chance of 
obtaining a z value equal to or less than z = —1.67, we can reject the null hypothesis at the .05 
level if we employ the nondirectional alternative hypothesis H,: p < 8, with a = .05. 


2. Equation 1.2 is employed to compute the standard error of the population mean when the size 
of the underlying population is infinite. In practice, it is employed when the size of the 
underlying population is large and the size of the sample is believed to constitute less than 
596 of the population. However, among others, Freund (1984) notes that in a finite 
population, if the size of a sample constitutes more than 5% of the population, a correction 
factor is introduced into Equation 1.2. The computation of the standard error of the mean 
with the finite population correction factor is noted below: 


o N-n 
g--——— 


*OARNN-I1 


Where: N represents the total number of subjects/objects that comprise the population. 





The finite population corrected equation will result in a smaller value for oy. This is the 
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case, since as the proportion of a population represented by a sample increases, the less 
variability there will be among the means that comprise the sampling distribution, and thus 
the smaller the expected difference between the sample mean obtained for a set of data and 
the value of the population mean. Thus when n = N, employing the finite corrected equation, 
the value of o; will always equal zero. This is the case since when n = N, the sample mean 
and population mean will always be the same value, and thus no error is involved in 
estimating the value of pz. Since it is usually assumed that the size of a sample is less than 
596 of the population it represents, Equation 1.2 is the only equation listed in most sources 
for computing the value of oy. 


3. Inspection of Table A1 reveals that the exact percentage of cases in a normal distribution that 
falls within three standard deviations above or below the mean is 99.74%. 
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Test 2 


The Single-Sample ¢ Test 
(Parametric Test Employed with Interval/Ratio Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Does a sample of n subjects (or objects) come from a popula- 
tion in which the mean (u) equals a specified value? 


Relevant background information on test The single-sample ¢ test is one of a number of 
inferential statistical tests that are based on the ¢ distribution. Like the normal distribution, the 
t distribution is a bell-shaped, continuous, symmetrical distribution, which to the statistically 
unsophisticated eye is almost indistinguishable from the normal distribution. f, which is the 
computed test statistic for the single-sample ¢ test, represents a standard deviation score, and is 
interpreted in the same manner as the z value computed for the single-sample z test. The only 
difference between a z value and a t value is that for a given standard deviation score, the pro- 
portion of cases that falls between the mean and the standard deviation score will be a function 
of which of the two distributions one employs. Except when n = ~, for a given standard 
deviation score, a larger proportion of cases falls between the mean of the normal distribution 
and the standard deviation score than the proportion of cases that falls between the mean and that 
same standard deviation score in the t distribution. In point of fact, there are actually an infinite 
number of t distributions — each distribution being based on the number of subjects/objects in 
the sample. As the size of the sample increases, the proportions (and consequently the critical 
values) in the f distribution approach the proportions (and critical values) in the normal 
distribution, and, in fact, when n = œ, the normal and f distributions are identical. A more 
detailed discussion (as well as visual illustrations) of the 7 distribution can be found in Section 
VII. 

The single-sample ¢ test is employed in a hypothesis testing situation involving a single 
sample in order to determine whether or not a sample with a mean of X is derived from a popu- 
lation with a mean of u. If the result of the single-sample ¢ test yields a significant difference, 
the researcher can conclude there is a high likelihood the sample is derived from a population 
with a mean value other than u. 

The single-sample ¢ test is used with interval/ratio data. The test is employed when a 
researcher does not know the value of the population standard deviation (o), and therefore must 
estimate it by computing the sample standard deviation ($). As is noted in the discussion of the 
single-sample z test (Test 1), some sources argue that even if one knows the value of o, when 
the sample size is very small (generally less than 25), the single sample ¢ test provides a more 
accurate estimate of the underlying sampling distribution for the data. 

The following two assumptions which are noted for the single-sample z test, also apply to 
the single-sample ¢ test: a) The sample has been randomly selected from the population it 
represents; and b) The distribution of data in the underlying population the sample represents is 
normal. If either of the aforementioned assumptions is saliently violated, the reliability of the 
t test statistic may be compromised. 
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II. Example 


Example 2.1 A physician states that the average number of times he sees each of his patients 
during the year is five. In order to evaluate the validity of this statement, he randomly selects 
ten of his patients and determines the number of office visits each of them made during the past 
year. He obtains the following values for the ten patients in his sample: 9, 10, 8, 4, 8, 3, 0, 10, 
15,9. Do the data support his contention that the average number of times he sees a patient is 
five? 


III. Null versus Alternative Hypotheses 


Null hypothesis Hy: w = 5 


(The mean of the population the sample represents equals 5.) 


Alternative hypothesis H: p # 5 


(The mean of the population the sample represents does not equal 5. This is a nondirectional 
alternative hypothesis, and it is evaluated with a two-tailed test. In order to be supported, the 
absolute value of t must be equal to or greater than the tabled critical two-tailed t value at the 
prespecified level of significance. Thus, either a significant positive t value or a significant 
negative ¢ value will provide support for this alternative hypothesis.) 


or 
Ay: wp > 5 


(The mean of the population the sample represents is greater than 5. This is a directional 
alternative hypothesis, and it is evaluated with a one-tailed test. It will only be supported if 
the sign of t is positive, and the absolute value of t is equal to or greater than the tabled critical 
one-tailed t value at the prespecified level of significance.) 


or 
Ay pw <5 


(The mean of the population the sample represents is less than 5. This is a directional alter- 
native hypothesis, and it is evaluated with a one-tailed test. It will only be supported if the sign 
of t is negative, and the absolute value of t is equal to or greater than the tabled critical one-tailed 
t value at the prespecified level of significance.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected. 


IV. Test Computations 


Table 2.1 summarizes the number of visits recorded for the n = 10 subjects in Example 2.1. In 
order to compute the test statistic for the single-sample f test, it is necessary to determine the 
mean of the sample and to obtain an unbiased estimate of the population standard deviation. 

Employing Equation 1.1 (which is the same as Equation I.1 in the Introduction) the mean 
of the sample is computed to be X = 7.6. 
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Table 2.1 Summary of Data for Example 2.1 


Battery X x 
1 9 81 
2 10 100 
3 8 64 
4 4 16 
5 8 64 
6 3 9 
7 0 0 
8 10 100 
9 15 225 

10 9 81 
YXX-276 YX? = 740 


Equation 2.1 (which is the same as Equation I.8 in the Introduction) is employed to com- 
pute §, which represents an unbiased estimate of the value of the population standard deviation. 


(Equation 2.1) 





As is the case with the single-sample z test, computation of the test statistic for the single- 
sample ¢ test requires that the value of the standard error of the population mean be 
computed. sy, which represents an estimate of oy, is computed with Equation 2.2. Note that 
Sx, which is referred to as the estimated standard error of the population mean, is based on 
the value of $ computed with Equation 2.1. A discussion of the theoretical meaning of the 
standard error of the population mean can be found in Section VII of the single-sample z test. 
Further discussion of sy (which represents a standard deviation of a sampling distribution of 
means and is interpreted in the same manner as 05) can be found in Section V. 


(Equation 2.2) 


p 


Employing Equation 2.2, the value s = 1.34 is computed. 


s- = 425 1.34 


X JO 


It should be noted that neither $ or sz can ever be a negative value. If a negative value is 
obtained for either $ or sx, it indicates a computational error has been made. 
Equation 2.3 is the test statistic for the single-sample ¢ test.' 
TI ei 


t (Equation 2.3) 





Sx 
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Inspection of Equation 2.3 reveals that it is similar in structure to Equation 1.3, the equation 
for the single-sample z test. The only differences between the two equations are that: a) Equa- 
tion 2.3 employs the ¢ distribution as opposed to the z distribution; and b) The value of the 
standard error of the population mean is estimated in Equation 2.3 from the value of 5.” 

Employing Equation 2.3, the value t = 1.94 is computed. Note that in Equation 2.3, the 
value that represents p is the value u = 5 which is stated in the null hypothesis. 


7.6 - 5 


———— = 1.94 
1.34 


te 


V. Interpretation of the Test Results 


Since, like a z value, a t value represents a standard deviation score, except for the fact that a 
different distribution is employed, it is interpreted in the same manner. The obtained value t = 
1.94 is evaluated with Table A2 (Table of Student's t Distribution) in the Appendix.’ In 
Table A2 the critical t values are listed in relation to the proportion of cases (which are recorded 
at the top of each column) that falls below a specified ft score in the f distribution, and the number 
of degrees of freedom for the sampling distribution that is being evaluated (which are recorded 
in the left hand column of each row). Equation 2.4 is employed to compute the degrees of free- 
dom for the single-sample ¢ test. A full explanation of the meaning of the degrees of freedom 
can be found in Section VII. 


dí-n-i (Equation 2.4) 


Employing Equation 2.4, we compute that df= 10 — 1 29. Thus, the tabled critical t values 
that are employed in evaluating the results of Example 2.1 are the values recorded in the cells of 
Table A2 that fall in the row for df 2 9 and the columns with probabilities/proportions that 
correspond to the one- and two-tailed .05 and .01 values. These critical t values are summarized 
in Table 2.1. 


Table 2.1 Tabled Critical Two-Tailed and One-Tailed .05 and .01 ¢ Values 


t t 


.05 .01 


Two-tailed values 2.26 3.25 
One-tailed values 1.83 2.82 


Note that the tabled critical two-tailed value ft), = 2.26 is the value in the row df = 9 and 
the column p = .975, since £y, = 2.26 is the standard deviation score above which (as well as 
below which in the case of t 2 —2.26) a proportion equivalent to .025 of the cases in the dis- 
tribution falls. The tabled critical two-tailed value £9, = 3.25 is the value in the row df= 9 and 
the column p = .995, since tf), = 3.25 is the standard deviation score above which (as well as 
below which in the case of t 2 —3.25) a proportion equivalent to .005 of the cases in the 
distribution falls. The tabled critical one-tailed value £9, = 1.83 is the value in the row df=9 
and the column p = .95, since tọ, = 1.83 is the standard deviation score above which (as well 
as below which in the case of t = —1.83) a proportion equivalent to .05 of the cases in the 
distribution falls. The tabled critical one-tailed value £9, = 2.82 is the value in the row df=9 
and the column p = .99, since £9, = 2.82 is the standard deviation score above which (as well 
as below which in the case of t = —2.82) a proportion equivalent to .01 of the cases in the 
distribution falls. 
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The following guidelines are employed in evaluating the null hypothesis for the single- 
sample ¢ test. 

a) If the alternative hypothesis employed is nondirectional, the null hypothesis can be re- 
jected if the obtained absolute value of t is equal to or greater than the tabled critical two-tailed 
value at the prespecified level of significance. 

b) If the alternative hypothesis employed is directional and predicts a population mean 
larger than the value stated in the null hypothesis, the null hypothesis can be rejected if the sign 
of t is positive and the value of t is equal to or greater than the tabled critical one-tailed value at 
the prespecified level of significance. 

c) If the alternative hypothesis employed is directional and predicts a population mean 
smaller than the value stated in the null hypothesis, the null hypothesis can be rejected if the sign 
of t is negative and the absolute value of t is equal to or greater than the tabled critical one-tailed 
value at the prespecified level of significance. 

Employing the above guidelines, we can only reject the null hypothesis if the directional 
alternative hypothesis H,: u > 5 is employed, and the null hypothesis can only be rejected at 
the .05 level. This is the case since the obtained value of t is a positive number which is greater 
than the tabled critical one-tailed .05 value ft), = 1.83. Note that the alternative hypothesis 
H,: p > 5 is not supported at the .01 level, since the obtained value f = 1.94 is less than the 
tabled critical one-tailed .01 value ź „ = 2.82. 

The nondirectional alternative hypothesis H,:  # 5 is not supported since the obtained 
value t = 1.94 is less than the tabled critical two-tailed .05 value t,, = 2.26. 

The directional alternative hypothesis H,: p < 5 is not supported since the obtained value 
t= 1.94 is a positive number. In order for the alternative hypothesis H,: u < 5 to be supported, 
the computed value of t must to be a negative number (as well as the fact that the absolute value 
of t must be equal to or greater than the tabled critical one-tailed value at the prespecified level 
of significance). 

In Section IV it is noted that the estimated standard error of the population mean (5;) 
computed for the single-sample ¢ test represents a standard deviation of a sampling distribution 
of means. The use of the t distribution for Example 2.1 is based on the fact that when the popula- 
tion standard deviation is unknown, the latter distribution provides a better approximation of the 
underlying sampling distribution than does the normal distribution. Figure 2.1 depicts the samp- 
ling distribution employed for Example 2.1. This sampling distribution is interpreted in the same 
manner as the sampling distribution for the single-sample z test which is depicted in Figure 1.1. 

Inspection of the sampling distribution depicted in Figure 2.1 reveals that the obtained value 
t= 1.94 falls to the right of the tabled critical one-tailed value f. = 1.83. At this point it should 
be noted that ¢,, = 1.83 is greater than Z, = 1.65, the tabled critical one-tailed .05 value 
employed with the normal distribution. Both of these values demarcate the upper 5% of their re- 
spective distributions. If one elects to employ the single-sample z test, as opposed to the single- 
sample ¢ test, for an analysis in which the value of o is unknown, it will inflate the likelihood 
of committing a Type I error. This is the case since, except when the sample size is very large 
(in which case the corresponding values of t and z are identical), a tabled critical z value at a 
prespecified level of significance will always be smaller than the corresponding tabled critical 
t value at that same level of significance. Thus, in the case of Example 2.1, if we employ the 
tabled critical one-tailed value z o; = 1.65, the likelihood of committing a Type I error will be 
greater than .05. This can be confirmed by inspection of Table A2 which indicates that for df 
= 9, the proportion of cases in the f distribution that falls at or above the value of t = 1.65 is 
greater than .05 (i.e., a t score of 1.65 falls below the 95th percentile of the distribution). 

A summary of the analysis of Example 2.1 with the single-sample ¢ test follows: With 
respect to the average number of times the doctor sees a patient, we can conclude that the sample 
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Figure 2.1 Sampling Distribution for Example 2.1 


of 10 subjects comes from a population with a mean value other than 5 only if we employ the 
directional alternative hypothesis H,: u > 5, and prespecify as our level of significance a = 
.05. This result can be summarized as follows: 1(9) = 1.94, p < .05. (The degrees of freedom em- 
ployed in the analysis are noted in parentheses after the f .) 


VI. Additional Analytical Procedures for the Single-Sample t Test 
and/or Related Tests 


1. Determination of the power of the single-sample ¢ test and the single-sample z test, and 
the application of Test 2a: Cohen'sd index The power of either the single-sample z test or 
the single-sample ¢ test will represent the probability of the test identifying a difference between 
the value for the population mean stipulated in the null hypothesis and a specific value that 
represents the true value of the mean of the population represented by the experimental sample. 
In order to compute the power of a test, it is necessary for the researcher to stipulate the latter 
value which will be identified with the notation p,. In practice, a researcher can compute the 
power of a test for any value of p. 

The power of the test will be a function of the difference between the value of u stated in 
the null hypothesis and the value of u,. The test's power will increase as the absolute value of 
the difference between the values u and p, increases. This is the case, since if the sample is 
derived from a population with a mean value that is substantially above or below the value of u 
stated in the null hypothesis, it is likely that this would be reflected in the value of the sample 
mean (X). Obviously, the more the value of X deviates from the hypothesized value of u, the 
greater the absolute value of the numerators (i.e., X - u) in Equations 1.3 and 2.3 (which are, 
respectively, the equations for the single-sample z test and the single-sample f test). Assuming 
that the value of the denominator is held constant, the larger the value of the numerator, the larger 
the absolute value of the computed test statistic (1.e., z or f). The larger the latter value the more 
likely it is that the researcher will be able to reject the null hypothesis (assuming the obtained 
difference is in the direction predicted in the alternative hypothesis), and consequently the more 
powerful the test. 

Since the obtained value of the test statistic is also a function of the denominator of Equa- 
tions 1.3 and 2.3 (i.e., the actual or estimated standard error of the population mean), the latter 
value also influences the power of the test. Specifically, as the value of the denominator de- 
creases, the computed absolute value of the test statistic increases. It happens to be the case that 
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the value of the standard error of the mean is a function of the population standard deviation 
(which is estimated in the case of the t test) and the sample size. Inspection of Equations 1.3 and 
2.3 reveals that the standard error of the mean will decrease if the value of the standard deviation 
is decreased and the sample size is increased. Thus, by employing an accurate estimate of the 
population standard deviation (more specifically, in the case of the ¢ test, one that is not 
spuriously inflated due to sampling error) and a large sample size, one can minimize the value 
of the denominator in Equations 1.3 and 2.3, and consequently maximize the absolute value of 
the test statistic. As a result, one can increase the likelihood that the null hypothesis will be 
rejected, which increases the power of the test. 

The power of a statistical test can be represented both mathematically and graphically. 
Figures 2.2 and 2.3 illustrate the concept of power and its relationship to the Type I and Type II 
error rates. Both figures contain two distributions which represent the sampling distributions of 
means for two populations? In each figure, the sampling distribution on the left represents a 
population with the mean u (i.e., the value stated in the null hypothesis). The sampling dis- 
tribution on the right represents a population with the mean p, , which we will assume is the true 
value of the mean of the population from which the experimental sample is derived. Figures 2.2 
and 2.3 both assume a fixed value for the sample size upon which the sampling distributions are 
based, and that each of the underlying populations represented by the sampling distributions has 
the same standard deviation. Figure 2.2 represents a case in which there is a large difference 
between the values of u, and u, whereas Figure 2.3 represents a case in which there is a small 
difference between the two values. When expressed in standard deviation units, the magnitude 
of the absolute value of the difference between u, and p is referred to as the effect size. Thus, 
in Figure 2.2 a large effect size is present, whereas Figure 2.3 depicts a small effect size. 

The reader should note the following with respect to Figures 2.2 and 2.3. 

a) The closer the values u and u, are to one another, the more the sampling distributions 
of the two populations overlap. 

b) In the case of a one-tailed analysis, the value of alpha (a) is represented by area (///) in 
the distribution on the left. Recollect that a represents the likelihood of committing a Type I 
error (i.e., rejecting a true null hypothesis). Numerically, a represents the proportion of the left 
distribution that comprises area (///). In the case of a two-tailed analysis, the proportion of the 
left distribution represented by area (///) will be equal to 0/2. 

c) The value of beta (B) is represented by area (=) in the distribution on the right. p repre- 
sents the likelihood of committing a Type II error (i.e., retaining a false null hypothesis). 
Numerically, B represents the proportion of the right distribution that comprises area (=). 

d) The power of the test is represented by area (W) in the right distribution. Note that this 
is the area in the right distribution that falls to the right of the area delineating f. Numerically, 
the power of the test represents the proportion of the right distribution that comprises area (\\\). 
The power of the test can also be represented by subtracting the value of p from 1 (i.e., Power 
= 1-f). Note that the area in the left distribution that represents a overlaps the area in the right 
distribution representing the power of the test. 

e) In order to increase the value of a, one must move the boundary in the left distribution that 
delineates a to the left. By doing the latter, one will decrease the value of p, since the area in the 
right distribution that corresponds to B will decrease. By increasing the value of a, one also 
increases the area in the right distribution which represents the power of the test. This illustrates 
the fact that if one increases the likelihood of committing a Type I error (o), one decreases the 
likelihood of committing a Type II error (B), and at the same time increases the power of the test 
(1 — B). In the same respect, to decrease the value of a one must move the boundary in the left 
distribution that delineates alpha to the right. By doing the latter, one will increase the value of p, 
since the area in the right distribution that corresponds to B will increase. By decreasing the 
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Figure 2.3 Sampling Distributions Employed in Determining the Power 
of a Test Involving a Small Effect Size 


value of a, one also decreases the area in the right distribution which represents the power of the 
test. This illustrates the fact that if one decreases the likelihood of committing a Type I error, the 
likelihood of committing a Type II error increases, and at the same time the power of the test 
decreases. 

f) It should also be noted that at a given level of significance, a one-tailed test will be more 
powerful than a two-tailed test. This is the case since with a one-tailed test the point that 
delineates a will be farther to the left in the left distribution than will be the case with a two-tailed 
test. As an example, when a = .05, the tabled critical one-tailed value for z is zo, = 1.65, 
whereas the tabled critical two-tailed value is Z o; = 1.96. Since both of these critical values are 
in the left distribution, the former value will be farther to the left, thus expanding the area in the 
right distribution which represents the power of the test. 

Two methods will now be demonstrated which can be used to determine the power of either 
the single-sample ¢ test or the single-sample z test. The first method (which is more time con- 
suming) reveals all of the logical operations involved in computing the power of a test. The 
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second method, which employs Table A3 (Power Curves for Student's t Distribution) in the 
Appendix, requires fewer computations. It should be emphasized that whenever possible, a 
power analysis should be conducted prior to the data collection phase of a study. By computing 
power beforehand, one is able to design a study with a sample size that is large enough to detect 
the specific effect size predicted by the researcher. 


Method 1 for computing the power of the single-sample ¢ test and the single sample z test 
The first method will initially be demonstrated with reference to the single-sample z test. The 
reason for employing the latter test is that it will allow us to use Table A1 (Table of the Normal 
Distribution) in the Appendix, which lists probabilities for all z values between 0 and 4. De- 
tailed tables of the t distribution that list probabilities for all t values within this range are 
generally not available. 

Let us assume that in our example we are employing the same null hypothesis that is 
employed in Example 2.1 (i.e., Hj: p = 5). It will be assumed that the researcher wishes to 
evaluate the power of the single-sample z test in reference to the alternative hypothesis 
H,: p, = 6. Note that in conducting a power analysis, the alternative hypothesis states that 
the population mean is a specific value that is different from the value stated in H,. In 
conducting the power analysis, it will be assumed that the null hypothesis will be evaluated with 
a two-tailed test, with a = .05. For purposes of illustration, it will also be assumed that the 
researcher evaluates the null hypothesis employing a sample size of n = 121. In addition, we will 
assume that the value of the population standard deviation is known to be o = 4.25. 

Employing Equation 1.2, the value of the standard error of the population mean is 
computed. Thus: og = 4.25//121 = .39. 

Figure 2.4, which depicts the analysis graphically, is comprised of two overlapping normal 
distributions. Each distribution is a sampling distribution of population means. The distribution 
on the left, which is the sampling distribution of means of a population with a mean of 5, will be 
referred to as Distribution A. The distribution on the right, which is the sampling distribution 
of means of a population with a mean of 6, will be referred to as Distribution B. We have 
already determined above that ox = .39, and we will assume that this value represents the 
standard deviation of each of the sampling distributions. The area (///) delineates the proportion 
of Distribution A that corresponds to the value 0/2, which equals .025. This is the case, since 
a = .05 and a two-tailed test is being used. In such an instance, the proportion of the curve 
comprising the critical area in each of the tails of Distribution A will be .05/2 = .025. Area (=) 
delineates the proportion of Distribution B that corresponds to the probability of committing a 
Type II error (f). Area (Wy) delineates the proportion of Distribution B that represents the power 
of the test. 

The procedure for computing the proportions in Figure 2.4 will now be described. The first 
step in computing the power of the test requires one to determine how far above the value u = 
5 the sample mean will have to be in order to reject the null hypothesis. Equation 1.3 is 
employed to determine this minimum required difference. By algebraically transposing the terms 
in Equation 1.3 we can determine that X - u = (Z);)(ox). Thus, by substituting the values 
Zos = 1.96 (which is the tabled critical two-tailed .05 z value) and oy = .39 in the latter 
equation we can compute that the minimum required difference is X - p = (1.96)(.39) = .76. 

Thus, any sample mean .76 units above or below the value u = 5 will allow the researcher 
to reject the null hypothesis at the .05 level (if a two-tailed analysis is employed). With respect 
to evaluating the power of the test in reference to the alternative hypothesis H: p; = 6, the 
researcher is only concerned with a mean value above 5 (which will fall in the right tail of 
Distribution A).° Thus, a mean value of X = 5.76 or greater will allow the researcher to reject 
the null hypothesis (since p + (zg,(05) = 5 + .76 = 5.76). 
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Figure 2.4 Visual Representation of Power when H: p =5 
and H,: u = 6 for n = 121 


The next step in the analysis requires that the area in Distribution B that falls between the 
mean pg, = 6 and the value 5.76 be computed. This is accomplished by employing Equation 1.3 
and substituting 5.76 to represent the value of X and p, = 6 to represent the value of p. 


Utilizing Table A1, we determine that the proportion of Distribution B that lies between p, = 6 
and 5.76 (i.e., between the mean and a z score of —.62 or +.62) is .2324. Since the value 5.76 is 
below the mean of Distribution B, if .5 (which is the proportion of Distribution B that falls above 
the mean p, = 6) is added to .2324, the resulting value of .7324 will represent the power of the 
test. This latter value is represented by area (\\\) in Figure 2.4. The likelihood of committing a 
Type II error (1.e., B) is represented by area (=). The proportion of Distribution B that constitutes 
this latter area is determined by subtracting the value .7324 from 1. Thus: B= 1 —.7324 = 2676. 
Based on the results of the power analysis we can state that if the alternative hypothesis 
H: p, = 6 is true, the likelihood that the null hypothesis will be rejected is .7324, and, at the 
same time, there is a .2676 likelihood that it will be retained. If the researcher considered the 
computed value for the power too low (which we are assuming is determined prior to 
implementing the study), she can increase the power of the test by employing a larger sample 
size. 

If the value of o is not known and has to be estimated from the sample data, the power 
analysis will be based on the : distribution instead of the normal distribution. In such a case the 
identical protocol described above for computing power is employed, except for the fact that a 
tabled critical t value is used in place of the tabled critical z value. Unless the sample size is 
extremely large, the tabled critical t value will be larger than the tabled critical z value used for 
the same data. As a result of this, the power of the test computed for the ¢ distribution will be 
lower than the value computed for the normal distribution. 

In the case of the example under discussion, if the ¢ distribution is employed one would 
use the tabled critical two-tailed .05 t value for df= n — 1 = 121 — 1 = 120. From Table A2 it can 
be determined that this value is tf), = 1.98. Using the latter value and the value sy = .39, it can 
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be determined that .77 is the minimum required difference in order to achieve significance. When 
.77 is added to the value p = 5, it indicates that a sample mean of 5.77 or greater (as well as 4.23 
or lower) will allow the researcher to reject the null hypothesis. The ¢ value required to complete 
the power calculations is determined by utilizing Equation 2.3 and substituting 5.77 to represent 
the value of X and p, = 6 to represent the value of u. The calculations are noted below. 


(toss =X - p 
(1.98)(.39) = .77 


5.77 - 6 _ 
39 


t = -.59 


Detailed tables of the ¢ distribution indicate that for df = 120 the proportion of cases 
between the mean and a t score of —.59 or +.59 is approximately .22.’ The power of the test is 
derived by adding .5 to the latter value. Thus, Power = .22 + .5 = .72, which is slightly lower 
than the value .7324 obtained for the normal distribution. 

It was previously noted that the size of the sample employed in a study is directly related 
to the power of a statistical test. Thus, in the example under discussion, if instead of using a 
sample size of n = 121, we employ a sample size of n = 10, the power of both the single-sample 
z test and the single-sample ź test will be considerably less than the computed values .7324 and 
.72. In point of fact, when n = 10 the power of the single-sample z test equals .1122. The 
dramatic decrease in power for the small sample size can be understood by determining the 
minimum amount by which X and u must differ from one another in order to reject the null 
hypothesis. Since we are still employing the normal distribution, when n = 10 the same tabled 
critical two-tailed value zo, = 1.96 is used. However, the value of ox is increased substantially, 
since Oy = 4.25//10 = 1.34. Employing the values Zos = 1.96 and o, = 1.34, we can 
compute that the minimum required difference in order to reject H, when n = 10 is 2.63. 
Specifically: X - u = (1.96)(1.34) = 2.63. 

A sample mean that is 2.63 units above u = 5 is equal to 7.63. The latter value will fall 
farther to the right in the right tail of Distribution B than the value 5.76 which is computed when 
n = 121. Substituting the values X = 7.63 and u = 6 in Equation 1.3, we determine that 
X = 7.63 is 1.22 standard deviation units above the mean of Distribution B. 


Z = 7.63 - 6 = 1.22 
1.34 


If one examines Figure 2.5, which depicts the analysis graphically, it can be seen that in 
Distribution B the value z = 1.22 lies to the right of the mean of the distribution. Thus, when n 
= 10 the power of the single-sample z test will be represented by the proportion of Distribution 
B that comprises area (\\\). Employing the table for the normal distribution, it can be determined 
that the proportion of the curve to the right of z = 1.22 is .1122, which represents the power of 
the test. On the basis of this, we can determine that the likelihood of committing a Type II error 
(represented by area (=)) will be D = 1 - .1122 = .8878, which is substantially greater than 
the value .2676 obtained when n = 121. Note that area (\\\) in Distribution B is much smaller 
than the corresponding area depicted in Figure 2.4 (when n = 121). By virtue of area (\\\) being 
smaller, the proportion of Distribution B in Figure 2.5 representing area (2) is substantially larger 
than the corresponding proportion/area in Figure 2.4. 

The ¢ distribution will now be applied to the above problem. Let us assume that n = 10 and 
the value of o is unknown. The latter value, however, is estimated from the sample data to be 
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Figure 2.5 Visual Representation of Power when H,: u = 5 and 
H; w= 6 for n -10 


§ = 4.25. The aforementioned values n = 10 and § = 4.25 correspond to those employed in 
Example 2.1. If one wants to compute the power of the single-sample ¢ test for Example 2.1 
with reference to the alternative hypothesis H,: p, = 6, the same protocol as described above 
for the normal distribution is employed, except for the fact that the value t,, = 2.26 (which is 
the tabled critical two-tailed .05 t value for df = 9) is used in the analysis. Thus: 


[os S¥ = X - H 
(2.26)(1.34) = 3.03 


(2805 -6 _ 45) 


1.34 


The use of the value X - 8.03 in the t test equation above is predicated on the fact that a 
mean of 8.03 is 3.03 units above the value u = 5 stated in the null hypothesis. Detailed tables of 
the t distribution indicate that for df= 9, the proportion of cases that falls above a t score of 1.51 
is approximately .085 (which corresponds to area (Wy) in Figure 2.5 if the latter represented the 
t distribution). The value .085 represents the power of the test. The likelihood of committing 
a Type II error (which corresponds to area (=)) is B = 1 - .085 = .915. 

A comparison of the values obtained for the power of the single-sample z test and the 
single-sample test for the two sample sizes employed in the discussion of power (i.e., for n = 
121 and n = 10), reveals that when the values of both n and the standard deviation are fixed, the 
single-sample z test provides a more powerful test of an alternative hypothesis than does the 
single-sample ¢ test (keeping in mind, however, that the use of the single-sample z test is 
justified only if the value of o is known). 


Test 2a: Cohen'sd index (Method 2 for computing the power of the single-sample ¢ test and 
thesingle-samplez test) It was noted previously that when the magnitude of the absolute value 
of the difference between p, and p is expressed in standard deviation units, the resulting value 
is referred to as the effect size. The computation of effect size, represented by the notation d, can 
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be summarized by Equation 2.5. Throughout the book, the d statistic computed with the 
Equation 2.5 will be referred to as Cohen's d index, since it was first employed as a measure of 
effect size by Jacob Cohen (1977, 1988). 


d = Lom (Equation 2.5) 
o 


In the above equation, in the case of the single-sample z test the value of c will be known, 
whereas in the case of the single-sample ¢ test the latter value will have to be estimated (either 
from the sample data or from prior research). Cohen (1977; 1988, pp. 24-27) has proposed the 
following (admittedly arbitrary) d values as criteria for identifying the magnitude of an effect 
size: a) A small effect size is one that is greater than .2 but not more than .5 standard deviation 
units; b) A medium effect size is one that is greater than .5 but not more than .8 standard 
deviation units; c) A large effect size is greater than .8 standard deviation units. 

Note that in Equation 2.5 the effect size is based on population parameters and does not take 
into account the size of the sample. Since the power of a test is also a function of sample size, 
itis necessary to convert the value of d into a measure that takes into account both the population 
parameters and the sample size. This measure, represented by the notation ô (which is the lower 
case Greek letter delta), is referred to as the noncentrality parameter. The value of 5 is com- 
puted with Equation 2.6.? 


ô = dyn (Equation 2.6) 


If the value of 6 is computed for a specific sample size, the power of both the single-sample 
z test and the single-sample ¢ test can be determined by using Table A3 in the Appendix, which 
consists of four sets of power curves in which the value of à is plotted in reference to the power 
of a test. Each set of power curves is based on a different level of significance. Specifically: 
Table A3-A is employed for either a two-tailed analysis with a = .01 or a one-tailed analysis with 
a = .005. Table A3-B is employed for either a two-tailed analysis with a = .02 or a one-tailed 
analysis with a = .01. Table A3-C is employed for either a two-tailed analysis with a = .05 or 
a one-tailed analysis with a = .025. Table A3-D is employed for either a two-tailed analysis with 
a= .10 or a one-tailed analysis with a = .05. Note that each set of power curves is comprised of 
either eight or ten curves, each of which represents a different degrees of freedom value. When 
the degrees of freedom computed for an experiment do not equal one of the df values represented 
by the curves, the researcher must interpolate to approximate the power of the test. Regardless 
of the sample size, the curve for df = « should always be used in determining the power of the 
single-sample z test. The latter curve is also used for the single-sample ¢ test for large sample 
sizes. 

The protocol for employing the curves in Table A3 is as follows: a) Compute the value of 
6; b) Upon locating 6 on the horizontal axis of the appropriate set of curves, draw a line that is 
perpendicular to the axis which intersects the curve that represents the appropriate df value; and 
c) At the point the line intersects the curve, drop a perpendicular to the vertical axis on which 
power values are noted. The point at which the latter line intersects the vertical axis will indicate 
the power of the test. 

The noncentrality parameter will now be employed to compute the power of the single- 
sample z test and the single-sample f test using the same data employed to demonstrate Method 
1. Thus, the power analysis will assume that the null hypothesis will be evaluated with a two- 
tailed test, with a = .05. In addition: 
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H:u-25 Ay pp, = 6 0-$-425 n= 121 


Employing Equation 2.5, the value d = .235 is computed. 


d = —— = .235 
4.25 


Note that using Cohen’s (1977, 1988) criteria for effect size, the value d = 235 indicates that we 
are attempting to detect a small effect size. 
For n = 121, Equation 2.6 is employed to calculate the value 6 = 2.59. 


ò = (.285y121 = 2.59 


Employing the power curve for df= œ in Table A3-C, the power of the single-sample z test 
is determined to be approximately .73. Since there is no curve for df = 120, the power of the 
single-sample ¢ test is based on a curve that falls between the df= œ and df= 24 power curves. 
Through interpolation, the power of the single-sample f test is determined to be approximately 
.72. Note that these are the same values that are computed with Method 1. 

For the same example, with n = 10, 6 = (.235)/10 = .743. Employing the power curve 
for df = ~, the power of the single-sample z test is determined to be approximately .11. Since 
df - 9, using the df= 6 and df= 12 power curves as reference points, we determine that the power 
of the single-sample ¢ test is approximately .085. These values are consistent with those com- 
puted with Method 1. 

It should be emphasized again that, whenever possible, prior to the data collection phase 
of a study, a researcher should stipulate the minimum effect size that she is attempting to detect. 
The smaller the effect size, the larger the sample size that will be required in order to have a test 
of sufficient power that will allow one to reject a false null hypothesis. As long as a researcher 
knows oris able to estimate (from the sample data) the population standard deviation, by employ- 
ing trial and error one can substitute various values of n in Equation 2.6 until the computed value 
of ò corresponds to the desired value for the power of the test. Power tables developed by Cohen 
(1977, 1988) are commonly employed within the framework of the present discussion as a quick 
means of determining the minimum sample size necessary to achieve a specific level of power 
in reference to a specific effect size. 

In closing the discussion of power, it should be noted that if a researcher employs a large 
enough sample size, a significant difference can be obtained almost 10046 of the time. Over the 
years various researchers have pointed out that the value of a sample mean is rarely if ever equal 
to the value of p stated in the null hypothesis — in other words, that the null hypothesis is rarely 
if ever true. Obviously a researcher must discern whether or not a statistically significant dif- 
ference that reflects a minimal effect size is of any practical or theoretical significance. In 
instances where it is not, for all practical purposes, if one rejects the null hypothesis under such 
circumstances, one is committing a Type I error. Criticisms that have been directed toward the 
conventional hypothesis testing model (i.e., the model that rejects the null hypothesis of zero 
difference when a result is statistically significant) are addressed in the discussion of meta- 
analysis and related topics, which can be found in Section IX (the Addendum) of the Pearson 
product-moment correlation coefficient (Test 28). 


2. Computation of a confidence interval for the mean of the population represented by a 


sample The hypothesis testing procedure described for the single-sample f test and the single- 
sample z test merely allows the researcher to determine whether or not it is reasonable to 
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conclude that the mean of a population is equal to a specific value. Interval estimation, another 
methodology used in inferential statistics (which is discussed briefly in the Introduction), allows 
a researcher to specify a range of values within which she can be confident the true value of a 
population parameter falls. One such interval which can be computed from sample data is re- 
ferred to as a confidence interval. 

In this section the procedure for computing a confidence interval for the mean of a popu- 
lation will be described. The confidence interval for the mean is a range of values within which 
a researcher can be confident to a specified degree that the true value of the population mean 
falls. When the value of the population standard deviation is unknown, computation of a 
confidence interval for a single sample involving interval/ratio data utilizes the ¢ distribution. 
The following confidence intervals are most commonly computed: a) The 95% confidence 
interval stipulates the range of values within which one can be 95% confident the true population 
mean falls. Stated in probabilistic terms, there is a .95 probability/likelihood that the true value 
of the population mean falls within the range of values that define the 95% confidence interval; 
b) The 99% confidence interval stipulates the range of values within which one can be 99% 
confident the true population mean falls. Stated in probabilistic terms, there is a .99 
probability/likelihood that the true value of the population mean falls within the range of values 
that define the 9996 confidence interval. 

Equation 2.7 is the general equation for computing a confidence interval for a population 
mean. 


CL =X + (Gp (Equation 2.7) 


Where: t, represents the tabled critical two-tailed value in the f distribution, for df — n — 1, 
below which a proportion (percentage) equal to [1 — (a/2)] of the cases falls. If the 
proportion (percentage) of the distribution that falls within the confidence interval is 
subtracted from 1 (100%), it will equal the value of a. 


Equation 2.8 is employed to compute the 9596 confidence interval, which will be 
represented by the notation CI ,. (since .95 is equivalent to 95%). 


CI, = X + (t 955) (Equation 2.8) 


In Equation 2.8, t ,, represents the tabled critical two-tailed .05 t value for df = n — 1. By 
employing the latter critical t value, one will be able to identify the range of values within the 
sampling distribution that define the middle 95% of the distribution. Only 5% of the scores in 
the sampling distribution will fall outside that range. Specifically, 2.596 of the scores will fall 
above the upper limit of the range, and 2.596 of the scores will fall below the lower limit of the 
range. 

Equation 2.9 is employed to compute the 9996 confidence interval, which will be 
represented by the notation CI „ (since .99 is equivalent to 99%). 


CI gg = X + (t 955) (Equation 2.9) 


Note that the only difference between Equation 2.9 and Equation 2.8 is that in Equation 2.9 
the critical value £j, is employed. The latter value represents the tabled critical two-tailed .01 
value for df = n — 1. By using the two-tailed 7, value, one will be able to identify the range of 
values within the sampling distribution that define the middle 99% of the distribution. Only 1% 
of the scores in the sampling distribution will fall outside that range. Specifically, .5% of the 
scores will fall above the upper limit of the range, and .596 of the scores will fall below the lower 
limit of the range. 
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The values X = 7.6, Sy = 1.34, and Dg 2.26 will now be substituted in Equation 2.8 
to compute the 95% confidence interval for the mean of the population employed in Example 2.1. 


CI 


.95 


= 7.6 + (2.26)(1.34) = 7.6 + 3.03 


The above result can be summarized as follows: 4.57 < u < 10.63. The notation 4.57 < u 
< 10.63 means that the value of p is greater than or equal to 4.57 and less than or equal to 10.63. 
This result tells us that if a mean of X = 7.6 is computed for a sample size of n = 10, we can be 
95% confident (or the probability is .95) that the true value of the mean of the population the 
sample represents falls between the values 4.57 and 10.63. Thus, with respect to Example 2.1, 
the physician can be 95% confident that the average number of visits per patient is between 4.57 
and 10.63. 

Equation 2.9 will now be employed to compute the 99% confidence interval for the 
population mean in Example 2.1. 


CI 


.99 


= 7.6 + (3.25)(1.34) = 7.6 + 4.36 


The above result can be summarized as follows: 3.24 < u < 11.96. This result tells us that 
if a mean of X = 7.6 is computed for a sample size of n = 10, we can be 99% confident (or the 
probability is .99) that the true value of the mean of the population the sample represents falls 
between the values 3.24 and 11.96. Thus, with respect to Example 2.1, the physician can be 99% 
confident that the average number of visits per patient is between 3.24 and 11.96. 

Note that the range of values which defines the 99% confidence interval is larger than the 
range of values that defines the 95% confidence interval. This will always be the case, since it 
is only logical that by stipulating a larger range of values one will be able to have a higher degree 
of confidence that the true value of the population mean has been included within that range. It 
is also the case that the larger the sample size employed in computing a confidence interval, the 
smaller the range of values that will define the confidence interval. Figures 2.6 and 2.7 provide 
a graphical summary of the computation of the 95% and 99% confidence intervals. 

Note that in Figure 2.6 the following is true: a) 47.5% of the scores in the sampling dis- 
tribution fall between the sample mean X = 7.6 and 4.57, the lower limit of CI ,,; and b) 47.5% 
of the scores in the sampling distribution fall between the sample mean X - 7.6 and 10.63, the 
upper limit of C7 ,.. Thus, the area of the curve between the scores 4.57 and 10.63 represents 
the middle 95% of the sampling distribution. Two and one-half half percent of the scores in the 
distribution fall below 4.57 and 2.5% of the scores fall above 10.63. The scores which are below 
4.57 or greater than 10.63 comprise the extreme 5% of the sampling distribution. 

In Figure 2.7 the following is true: a) 49.5% of the scores in the sampling distribution fall 
between the sample mean X = 7.6 and 3.24, the lower limit of CI ,,; and b) 49.5% of the scores 
in the sampling distribution fall between the sample mean X - 7.6 and 11.96, the upper limit 
of CI ,,. Thus, the area of the curve between the scores 3.24 and 11.96 represents the middle 
99% of the sampling distribution. One-half of one percent of the scores in the distribution fall 
below 3.24 and .5% of the scores fall above 11.96. The scores which are below 3.24 or greater 
than 11.96 comprise the extreme 146 of the sampling distribution. 

The reader should take note of the fact that in Figures 2.6 and 2.7 the sample mean is 
employed to represent the mean of the sampling distribution. This is in contrast to the sampling 
distribution depicted in Figure 2.1, where the hypothesized value of the population mean is em- 
ployed to represent the mean of the sampling distribution. The reason for using different means 
for the two sampling distributions is that the sampling distribution depicted in Figure 2.1 is used 
to determine the likelihood of a sample mean deviating from the hypothesized value of the 
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Figure 2.6 Graphical Representation of 95% Confidence Interval for Example 2.1 
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Figure 2.7 Graphical Representation of 99% Confidence Interval for Example 2.1 


population mean, while the sampling distribution depicted in Figures 2.6 and 2.7 reflects the fact 
that one is engaged in interval estimation, and is thus employing the sample mean to predict the 
true value of the population mean. 

Although, as noted previously, CI}; and Cl, are the most commonly computed 
confidence intervals, it is possible to calculate a confidence interval at any level of confidence. 
Thus, if one wanted to compute the 90% confidence interval, the equation 
CI, = X + (tg) is employed. In the latter equation 7 ,, represents the tabled critical 
one-tailed .05 t value (which is also the tabled critical two-tailed .10 t value), since the latter 
values (which for df = 9 is t = 1.83) establishes the boundaries for the middle 90% of the 
sampling distribution. Only 10% of the scores in the sampling distribution will fall outside that 
range (5% of the scores will fall above the upper limit of the range, and 5% of the scores will fall 
below the lower limit of the range). Thus, for Example 2.1: 


CI,, = 7.6 + (1.83)(1.34) = 7.6 + 2.45 Thus: 5.15 < p < 10.05 
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Employing the same logic, the 98% confidence interval can be computed by using the 
tabled critical one-tailed .01 value £9, = 2.82 in the confidence interval equation. The latter t 
value is employed, since it establishes the boundaries for the middle 98% of the sampling 
distribution. Thus: 


Clog = 7.6 € (2.82)1.34) = 7.6 + 3.78 Thus: 3.82 < u < 11.38 


It should be noted that in order to accurately compute a confidence interval, one must have 
access to tables of the t distribution which provide the appropriate tabled value for the confidence 
interval in question. Since most published tables of the ¢ distribution provide only the tabled 
critical one- and two-tailed .05 and .01 values, they only allow for accurate computation of the 
following confidence intervals: CZ, , Cl, , Chg , Clog. Table A2 is more detailed than 
most tables of the t distribution, and thus it allows for accurate computation of a greater number 
of confidence intervals than those noted above. In instances where an exact t value is not tabled, 
interpolation can be used to estimate that value. 

Although the computation of confidence intervals is not described in the discussion of the 
single-sample z test, when the value of the population standard deviation is known, the normal 
distribution (as opposed to the ¢ distribution) is employed to compute a confidence interval. If 
a researcher knows the value of the population standard deviation, Equation 2.10 is the general 
equation for computing a confidence interval. 


CI 


a-a) ~ X + au) (03) (Equation 2.10) 


Where:  z,, represents the tabled critical two-tailed value in the normal distribution below 
which a proportion (percentage) equal to [1 - («/2)] of the cases falls. If the 
proportion (percentage) of the distribution that falls within the confidence interval is 
subtracted from 1 (100%), it will equal the value of a. 


Note that the basic difference between Equation 2.10 and Equation 2.7, is that Equation 
2.10 employs a tabled critical z value instead of the corresponding t value for the same percentile. 
Additionally, since use of Equation 2.10 assumes that the value of o is known, the actual value 
of the standard error of the population mean can be computed. Thus, 0; is used in place of the 
estimated value sz employed in Equation 2.7. 

Generalizing from Equation 2.10, Equation 2.11 is employed to compute the 95% con- 
fidence interval for the mean of a population when the normal distribution is used. 


Cl = X + (tog (Equation 2.11) 


If Equation 2.11 is employed with Example 2.1, the only value that will be different from 
those in Equation 2.8 is the tabled critical two-tailed value zo, = 1.96, which is used in place 
of fy, = 2.26. Since we are assuming that o = $, the values of sz and oy are equivalent. 
Thus, og = o//n = 4.25//10 = 1.34. 

Equation 2.11 will now be utilized to compute the 95% confidence interval for Example 
2.1. 


CI 


.95 


= 7.6 + (1.96)(1.34) = 7.6 + 2.63 
The above result can be summarized as: 4.97 < u < 10.23. Thus, by using the normal 
distribution to compute the 95% confidence interval, the physician can be 95% confident (or the 


probability is .95) that the average number of visits per patient is between 4.97 and 10.23. Note 
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that when the normal distribution is employed with the data for Example 2.1, the range of values 
that defines the 95% confidence interval is smaller than the range of the values that is computed 
with the ¢ distribution. This will always be the case, since, at the same level of confidence, a 
tabled z value will always be smaller than the corresponding tabled t value.'? As a result of this, 
the product resulting from multiplying the tabled value by the standard error of the population 
mean will be smaller when the normal distribution is employed. 

If the value of c is known, Equation 2.12 (as opposed to Equation 2.9) is used to calculate 
the 99% confidence interval for the mean of a population. 


Cl = X + (o9 (Equation 2.12) 


Note that in contrast to Equation 2.9, Equation 2.12 employs the tabled critical two-tailed 
value Z,, = 2.58 instead of the corresponding value £9, = 3.25. Equation 2.12 will now be 
used to compute the 99% confidence interval for Example 2.1. 


CI 


.99 


= 7.6 + (2.58)(1.34) = 7.6 + 3.46 


The above result can be summarized as: 4.14 < u < 11.06. Thus, by using the normal dis- 
tribution to compute the 99% confidence interval, the physician can be 99% confident (or the 
probability is .99) that the average number of visits per patient is between 4.14 and 11.06. Note 
once again that the range of values obtained with Equation 2.12 (which utilizes the normal dis- 
tribution) for the 99% confidence interval is smaller than the range of the values that is obtained 
with Equation 2.9 (which utilizes the ¢ distribution). 

Figures 2.8 and 2.9 provide a graphical summary of the computation of the 95% and 99% 
confidence intervals with the normal distribution. 

It should be noted that even if the value of o is known, some researchers would challenge 
the use of the normal distribution in the above example. The rationale for such a challenge is 
that, if as a result of employing a single-sample z test, it is determined that a significant 
difference exists between the hypothesized value of the population mean and X, one can question 
the logic of employing the normal distribution in the computation of a confidence interval. This 
is the case, since, if we conclude that our sample is derived from a population with a different 
mean value than the hypothesized population mean, it is also possible that the population 
standard deviation 
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Figure 2.8 Graphical Representation of 95% Confidence Interval 
for Example 2.1 Through Use of the Normal Distribution 
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Figure 2.9 Graphical Representation of 99% Confidence Interval 
for Example 2.1 Through Use of the Normal Distribution 


is different than the value of o we have employed in the analysis. In such an instance, the best 
strategy would probably be to employ the sample data to estimate the population standard devi- 
ation. Thus, Equation 2.1 would be used to compute $ , and consequently Equations 2.8 and 2.9 
(employing the f distribution) are used to compute the 95% and 99% confidence intervals. 

Itis worth noting that it is unlikely that a researcher will employ the normal distribution to 
compute a confidence interval, for the simple reason that it is improbable that one will know the 
value of a population standard deviation and not know the value of the population mean. For this 
reason most sources only illustrate confidence interval computations in reference to the t distri- 
bution. 

In closing the discussion of confidence intervals, the reader should take note of the fact that 
the range of values which defines the 95% and 99% confidence intervals for Example 2.1 is ex- 
tremely large. Because of this the researcher will not be able stipulate with great precision a 
single value that is likely to represent the true value of the population mean. This illustrates that 
a confidence interval may be quite limited in its ability to accurately estimate a population 
parameter, especially when the sample size upon which it is based is small. The reader should 
also note that the reliability of Equation 2.7 will be compromised if one or more of the assump- 
tions of the single-sample f test are saliently violated. 


VII. Additional Discussion of the Single-Sample t Test 


Degrees of freedom The concept of degrees of freedom, which is frequently encountered in 
statistical analysis, represents the number of values in a set of data that are free to vary after 
certain restrictions have been placed upon the data. The concept of degrees of freedom will be 
illustrated through use of the following example. 

Assume that one is trying to construct a set consisting of three scores which are derived 
from a single sample, and it is known that the mean of the sample is X - 5. Under these 
conditions, two of the three scores that comprise the set can assume a variety of values, as long 
as the sum of the two scores does not exceed 15. This is the case since, if X = 5 and n = 3, itis 
required that XX = 15. Thus, some representative values that two of the three scores may assume 
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are: 1 and 4; 0 and 6; 8 and 6; and 1 and 1. Note that in all four cases the sum of the two scores 
is less than 15. The value of the third score in all four instances is predetermined based on the 
values of the other two scores. Thus, if we know that two of the three scores that comprise a set 
are | and 4, the third score must equal 10, since it is the only value that will yield XX = 15. 

The rationale for employing df = n — 1 in computing the degrees of freedom for the single- 
sample í test can be understood on the basis of the above discussion. Specifically, once the size 
of a sample is set and the mean assumes a specific value, only n — 1 scores will be free to vary. 
In the case of the single-sample ¢ test (as well as a variety of other inferential statistical tests) 
degrees of freedom are a function of sample size. Specifically, as the sample size increases, the 
degrees of freedom increases. However, this is not always the case. When evaluating categorical 
data, degrees of freedom are generally a function of the number of categories involved in the 
analysis rather than the size of the sample. 

Inspection of Table A2 reveals that as degrees of freedom increase, the lower the value of 
the tabled critical value that the computed absolute value of £ must be equal to or greater than 
in order to reject the null hypothesis at a given level of significance. Once again, this is not 
always the case. For instance, when employing the chi-square distribution (discussed later in the 
book) there is a direct relationship between degrees of freedom and the magnitude of the tabled 
critical value. In other words, as the number of degrees of freedom increases, the larger the 
magnitude of the tabled critical chi-square value at a given level of significance. 

As noted in Section I, in actuality a separate t distribution exists for each sample size, and 
consequently for each degrees of freedom value. Figure 2.10 depicts the f distribution for three 
different degrees of freedom values. Note that the t distribution (which is always symmetrical) 
closely resembles the normal distribution. As is noted in Section I, except when n = ~, for a 
given standard deviation score, a smaller proportion of cases will fall between the mean of the 
t distribution and that standard deviation score than the proportion of cases that fall between the 
mean and that same standard deviation score in the normal distribution. As sample size (and thus 
degrees of freedom) increases, the shape of the f distribution becomes increasingly similar in 
appearance to the normal distribution, and, in fact, when n = ~ becomes identical to it. As a 
result of this, when the sample size employed in a study is large (usually n > 200), for all 
practical purposes, the tabled critical values for the normal and ¢ distributions are identical. 

Inspection of the three ¢ distributions depicted in Figure 2.10 reveals that the lower the 
degrees of freedom, the larger the absolute value of t that will be required in order to reject a null 
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Figure 2.10. Representative ¢ Distributions for Different df Values 
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hypothesis at a given level of significance. As an example, the distance between the mean and 
the one-tailed .05 critical t value (which corresponds to the 95th percentile of a given curve) is 
greatest for the df = 5 distribution (where ft), = 2.02) and smallest for the df= œ distribution 
(where tg; = 1.65). 

Most tables of the ¢ distribution list selected degrees of freedom values ranging from 1 
through 120, and then list a final row of values for df= œ. The latter set of values are identical 
to those in the normal distribution, since, when df= œ, the two distributions are identical. As a 
general rule, for sample sizes substantially above 121 (which correspond to df= 120), the critical 
values for df = » can be employed. Tables of the t distribution do not include tabled critical 
values for all possible degrees of freedom below 120. The protocol that is generally used if the 
exact df value is not listed is to either interpolate the critical value or to employ the tabled df 
value that is closest to it. Some sources qualify the latter by stating that one should employ the 
tabled df value that is closest to but not above the computed df value. The intent of this strategy 
is to insure that the likelihood of committing a Type I error does not exceed the prespecified 
alpha value. 


VIII. Additional Examples Illustrating the Use of the Single- 
Sample ¢ Test 


If, in the case of Examples 1.1—1.4 (all of which are employed to illustrate the single-sample z 
test), a researcher does not know the value of o and has to estimate it from the sample data, 
the single-sample ¢ test is the appropriate test to use. The 30 scores noted in Section IV of 
the single-sample z test in reference to Example 1.1 can be employed to compute the estimated 
population standard deviation. Utilizing the 30 scores, the following values are computed: 
YXX-222, X = 7.4,and XX? = 1866. Equations 2.1—2.3 can now be employed to conduct the 
necessary calculations for the single-sample ¢ test. The null hypothesis that is evaluated is 
Hy: p = 8. 





(222)? 
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Using the t distribution for Examples 1.1—1.4, the degrees of freedom that are employed are 
df = 30 - 1 = 29. Fordf=29, the tabled critical two-tailed .05 and .01 values are t,, = 2.05 
and fy, = 2.76, and the tabled critical one-tailed .05 and .01 values are t,, = 1.70 and 
ty, = 2.46. Since the computed absolute value ¢ = 1.19 is less than all of the aforementioned 
critical values, the null hypothesis cannot be rejected. Note that when the single-sample z test 
is used to evaluate the same set of data, the directional alternative hypothesis H,: p < 8 is sup- 
ported at the .05 level. The discrepancy between the results of the two tests can be attributed to 
the fact that the estimated population standard deviation § = 2.77 employed for the single- 
sample ¢ test is larger than the value o = 2 used when the data are evaluated with the single- 
sample z test. 

The single-sample ¢ test cannot be employed for Examples 1.5 and 1.6 (which are also used 
to illustrate the single-sample z test), since, when n = 1, the estimated value of a population 
standard deviation will be indeterminate (since at least two scores are required to estimate a 
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population standard deviation). This is confirmed by the fact that no tabled critical £ values are 
listed for zero degrees of freedom (which is the case when n = 1). 

Example 2.2, which is based on the same data set as Example 2.1, is an additional example 
that can be evaluated with the single-sample ¢ test. 


Example 2.2. The Sugar Snack candy company claims that each package of candy it sells con- 
tains bonus coupons (which a consumer can use toward future purchases), and that the average 
number of coupons per package is five. Responding to a complaint by a consumer who says the 
company is shortchanging people on coupons, a consumer advocate purchases 10 bags of candy 
from a variety of stores. The advocate counts the number of coupons in each bag and obtains 
the following values: 9, 10, 8, 4, 8, 3, 0, 10, 15, 9. Do the data support the claim of the 
complainant? 


Since the data for Example 2.2 are identical to those employed for Example 2.1, analysis 
with the single-sample t test yields the value z = 1.94. The value z = 1.94, which is consistent 
with the directional alternative hypothesis H,: p > 5, is totally unexpected in view of the 
nature of the consumer’s complaint. If anything, the researcher evaluating the data is most likely 
to employ either the directional alternative hypothesis H,: p < 5 or the nondirectional 
alternative hypothesis H,: u + 5 (neither of which are supported at the .05 level). Thus, even 
though the directional alternative hypothesis H,: 4 > 5 is supported at the .05 level, the latter 
alternative hypothesis would not have been stipulated prior to collecting the data. 
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Endnotes 


1. In order to be solvable, Equation 2.3 requires that there be variability in the sample. If all 
of the subjects in a sample have the same score, the computed value of $ will equal zero. 
When $ = O,the value of s; will always equal zero. When sz = 0, Equation 2.3 becomes 
unsolvable, thus making it impossible to compute a value for t. It is also the case that when 
the sample size is n = 1, Equation 2.1 becomes unsolvable, thus making it impossible to 
employ Equation 2.3 to solve for t. 


2. Inthe event that o is known and n < 25, and one elects to employ the single-sample f test, 
the value of c should be used in computing the test statistic. Given the fact that the value 


of o is known, it would be foolish to employ 5$ as an estimate of it. 


3. The fdistribution was derived by William Gossett, a British statistician who published under 
the pseudonym of Student. 


4. Itis worth noting that if the value of the population standard deviation in Example 2.1 is 
known to be o = 4.25, the data can be evaluated with the single-sample z test. When 
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10. 


employed it yields the value z = 1.94, which is identical to the value obtained with Equation 
2.3. Specifically, since o = § = 4.25, 0% = sy = 4.25//10 = 1.34. Employing Equation 
1.3 yields z = (7.6 — 5)/1.34 = 1.94. As is the case for the single-sample f test, the latter 
value only supports the directional alternative hypothesis H,: u > 5 atthe .05 level. This 
is the case since z = 1.94 is greater than the tabled critical one-tailed value zo; = 1.65 in 
Table Al. The value z = 1.94, which is less than the tabled critical two-tailed value 
Zos = 1.96, falls just short of supporting the nondirectional alternative hypothesis 
H,: p * 5 at the .05 level. 


A sampling distribution of means for the f distribution when employed in the context of the 
single-sample ¢ test is interpreted in the same manner as the sampling distribution of means 
for the single-sample z test as depicted in Figure 1.1. 


In the event the researcher is evaluating the power of the test in reference to a value of p, 
that is less than u = 5, Distribution B will overlap the left tail of Distribution A. 


It is really not possible to determine this value with great accuracy by interpolating the 
entries in Table A2. 


If the table for the normal distribution is used to estimate the power of the single-sample t 
test in this example, it can be determined that the proportion of cases that falls above a z 
value of 1.51 is .0655. Although the value .0655 is close to .085, it slightly underestimates 
the power of the test. 


The value of 5 can be computed directly through use of the following equation: 
ô = (yu, - u)/(o//n). Note that the equation expresses effect size in standard deviation 
units of the sampling distribution. 


As noted in Section V, the only exception to this will be when the sample size is extremely 
large, in which case the normal and t distributions are identical. Under such conditions the 
appropriate values for z and t employed in the confidence interval equation will be identical. 
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Test 3 


The Single-Sample Chi-Square Test 


for a Population Variance 
(Parametric Test Employed with Interval/Ratio Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Does a sample of n subjects (or objects) come from a popu- 
lation in which the variance ( 0?) equals a specified value? 


Relevant background information on test The single-sample chi-square test for a popu- 
lation variance is employed in a hypothesis testing situation involving a single sample in order 
to determine whether or not a sample with an estimated population variance of $? is derived from 
a population with a variance of o?. If the result of the test is significant, the researcher can 
conclude there is a high likelihood the sample is derived from a population in which the variance 
is some value other than o°. The single-sample chi-square test for a population variance is 
based on the chi-square distribution. The test statistic is represented by the notation X? (where x 
represents the lower case Greek letter chi). 

The single-sample chi-square test for a population variance is used with interval/ratio 
level data and is based on the following assumptions: a) The distribution of data in the under- 
lying population from which the sample is derived is normal; and b) The sample has been 
randomly selected from the population it represents. If either of the assumptions is saliently 
violated, the reliability of the test statistic may be compromised.’ 

The chi-square distribution is a continuous asymmetrical theoretical probability distribution. 
A chi-square value must fall within the range 0 < X? < ©, and thus (unlike values for the normal 
and t distributions) can never be a negative number. As is the case with the ¢ distribution, there 
are an infinite number of chi-square distributions — each distribution being a function of the 
number of degrees of freedom employed in an analysis. Figure 3.1 depicts the chi-square distri- 
bution for three different degrees of freedom values. Inspection of the three distributions reveals: 





0 5 10 15 20 25 
x 
Figure3.1 Representative Chi-Square Distributions for Different df Values 
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a) The lower the degrees of freedom, the more positively skewed the distribution (i.e., the larger 
the proportion of scores at the lower end of the distribution); and b) The greater the degrees of 
freedom, the more symmetrical the distribution. A thorough discussion of the chi-square dis- 
tribution can be found in Section V. 


II. Example 


Example 3.1 7e literature published by a company that manufactures hearing aid batteries 
claims that a certain model battery has an average life of 7 hours (u = 7T) and a variance of 5 
hours (0? = 5). A customer who uses the hearing aid battery believes that the value stated in 
the literature for the variance is too low. In order to test his hypothesis the customer records the 
following times (in hours) for ten batteries he purchases during the month of September: 5, 6, 
4, 3, 11, 12, 9, 13, 6, 8. Do the data indicate that the variance for battery time is some value 
other than 5? 


III. Null versus Alternative Hypotheses 


Null hypothesis H; =5 


(The variance of the population the sample represents equals 5.) 


Alternative hypothesis H; # 5 


(The variance of the population the sample represents does not equal 5. This is a nondirectional 
alternative hypothesis, and it is evaluated with a two-tailed test. In order to be supported, the 
obtained chi-square value must be equal to or greater than the tabled critical two-tailed chi-square 
value at the prespecified level of significance in the upper tail of the chi-square distribution, or 
equal to or less than the tabled critical two-tailed chi-square value at the prespecified level of 
significance in the lower tail of the chi-square distribution. A full explanation of the protocol for 
interpreting chi-square values within the framework of the single-sample chi-square test for a 
population variance can be found in Section V.) 


Or 
A 
Hy: o > 5 
(The variance of the population the sample represents is greater than 5. This is a directional 
alternative hypothesis, and it is evaluated with a one-tailed test. In order to be supported, the 


obtained chi-square value must be equal to or greater than the tabled critical one-tailed chi-square 
value at the prespecified level of significance in the upper tail of the chi-square distribution.) 


or 
Hear <5 


(The variance of the population the sample represents is less than 5. This is a directional 
alternative hypothesis, and it is evaluated with a one-tailed test. In order to be supported, the 
obtained chi-square value must be equal to or less than the tabled critical one-tailed chi-square 
value at the prespecified level of significance in the lower tail of the chi-square distribution.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected.” 
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IV. Test Computations 
Table 3.1 summarizes the data for Example 3.1. 


Table3.1 Summary of Data for Example 3.1 
Battery X x? 
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In order to compute the test statistic for the single-sample chi-square test for a population 
variance, it is necessary to use the sample data to calculate an unbiased estimate of the 
population variance. $?, the unbiased estimate of o?, is computed with Equation 3.1 (which is 
the same as Equation L5 in the Introduction). 


Ly? m (xy 
= —— (Equation 3.1) 
T 


Employing Equation 3.1, the value $? - 12.01 is computed. 


2 
71 - X 
geo a. V ads 
=i 


Equation 3.2 is the test statistic for the single-sample chi-square test for a population 
variance. 


2 _ (n- 1) #? 


o? 


X (Equation 3.2) 


Employing the values n = 10, $? = 12.01, and o? = 5 (whichis the hypothesized value 
of o? stated in the null hypothesis), Equation 3.2 is employed to compute the value y? = 21.62. 


y - 20 = 902.01) _ 44 6 


V. Interpretation of the Test Results 
The computed value xy? = 21.62 is evaluated with Table A4 (Table of the Chi-Square Dis- 


tribution) in the Appendix. In Table A4, chi-square values are listed in relation to the 
proportion of cases (which are recorded at the top of each column) that falls below a tabled y? 
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value in the sampling distribution, and the number of degrees of freedom (which are recorded 
in the left-hand column of each row)? Equation 3.3 is employed to compute the degrees of 
freedom for the single-sample chi-square test for a population variance. 


dí-n-1 (Equation 3.3) 


Employing Equation 3.3, we compute that df= 10 — 1 29. For df= 9, the tabled critical 
two-tailed .05 values are rtm - 2.70 and ons = 19.02. These values are the tabled critical 
two-tailed .05 values, since the proportion of the distribution that falls between x? = 0 and 
ons = 2.70 is .025, and the proportion of the distribution that falls above Xans = 19.02 is .025. 
Thus, the extreme 5% of the distribution is comprised of chi-square values that fall below 
Xs = 2.70 and above ons = 19.02. In the same respect, the tabled critical two-tailed .01 
values are Xaos = 1.73 and Xaos = 23.59. These values are the tabled critical two-tailed .01 
values, since the proportion of the distribution that falls between x? = 0 and X us = 1.73 is 
.005, and the proportion of the distribution that falls above Xos = 23.59 is .005. Thus, the 
extreme 1% of the distribution is comprised of chi-square values that fall below Koos = 1.73 
and above Kaas = 23.59. 

For df = 9, the tabled critical one-tailed .05 values are 33X55 = 3.33 and Xs = 16.92. 
These values are the tabled critical one-tailed .05 values, since the proportion of the distribution 
that falls between x? = 0 and Xs = 3.33 is .05, and the proportion of the distribution that falls 
above ree = 16.92 is .05. In the same respect, the tabled critical one-tailed .01 values are 
Xo = 2.09 and Xo = 21.67. These values are the tabled critical one-tailed .01 values, since 
the proportion of the distribution that falls between X? = 0 and Xi - 2.00 is .01, and the 
proportion of the distribution that falls above Xa = 21.67 is .01. 

Figures 3.2 and 3.3 depict the tabled critical .05 and .01 values for the chi-square sampling 
distribution when df= 9. The mean of a chi-square sampling distribution will always equal the 
degrees of freedom for the distribution. Thus, Hp = df = n - 1. The standard deviation of 
the sampling distribution will always be 9o = \2df . Consequently, the variance will be 
0, = 2df. In the case of Example 3.1 where df = 9, Be = 9 and 0, - 18. 

The following guidelines are employed in evaluating the null hypothesis for the single- 
sample chi-square test for a population variance. 

a) If the alternative hypothesis employed is nondirectional, the null hypothesis can be 
rejected if the obtained chi-square value is equal to or greater than the tabled critical two-tailed 
chi-square value at the prespecified level of significance in the upper tail of the chi-square 
distribution, or equal to or less than the tabled critical two-tailed chi-square value at the pre- 
specified level of significance in the lower tail of the chi-square distribution. 

b) If the alternative hypothesis employed is directional and predicts a population variance 
larger than the value stated in the null hypothesis, the null hypothesis can only be rejected if the 
obtained chi-square value is equal to or greater than the tabled critical one-tailed chi-square value 
at the prespecified level of significance in the upper tail of the chi-square distribution. 

c) If the alternative hypothesis employed is directional and predicts a population variance 
smaller than the value stated in the null hypothesis, the null hypothesis can only be rejected if the 
obtained chi-square value is equal to or less than the tabled critical one-tailed chi-square value 
at the prespecified level of significance in the lower tail of the chi-square distribution. 

Employing the above guidelines, we can conclude the following. 

The nondirectional alternative hypothesis H;: o? » 5 is supported at the .05 level. This 
is the case since the obtained value X? = 21.62 is greater than the tabled critical two-tailed 
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X597 21.67 


x= 21.62 


Figure 3.3 Tabled Critical Two-Tailed and One-Tailed .01 x? Values for df = 9 


.05 value ons = 19.02 in the upper tail of the distribution. The nondirectional alternative 
hypothesis H: o? # 5 is not supported at the .01 level, since X? = 21.62 is less than the tabled 
critical two-tailed .01 value Kaos = 23.59 in the upper tail of the distribution. 

The directional alternative hypothesis H: o? > 5 is supported at the .05 level. This is 
the case since the obtained value X? = 21.62 is greater than the tabled critical one-tailed .05 
value Xos = 16.92 inthe upper tail of the distribution. The directional alternative hypothesis H: g^ «5 
is not supported at the .01 level, since X? = 21.62 is less than the tabled critical one-tailed .01 
value Yo - 21.67 in the upper tail of the distribution. Note that in order for the data to be 
consistent with the directional alternative hypothesis H;: o? » 5, the computed value of 
$? must be greater than the value o? - 5 stated in the null hypothesis. 

The directional alternative hypothesis H,: o? < 5isnotsupported. This is the case since 
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in order for the latter alternative hypothesis to be supported, the obtained chi-square value must 
be equal to or less than the tabled critical one-tailed .05 value Xs - 3.33 in the lower tail of 
the distribution. If a = .01 is employed, in order for the directional alternative hypothesis 
Ai: o? < 5 to be supported, the obtained chi-square value must be equal to or less than the 
tabled critical one-tailed value Yi - 2.00 in the lower tail of the distribution. In order for the 
data to be consistent with the alternative hypothesis H: o? « 5,thecomputed value of $? must 
be less than the value o? - 5 stated in the null hypothesis. 

A summary of the analysis of Example 3.1 with the single-sample chi-square test for a 
population variance follows: We can conclude that it is unlikely that the sample of 10 batteries 
comes from a population with a variance equal to 5. The data suggest that the variance of the 
population is greater than 5. This result can be summarized as follows: x? (9) = 21.62, p < .05. 

Although it is not required in order to evaluate the null hypothesis H,: o? = 5,through use 
of Equations L1/1.1 it can be determined that the mean number of hours the 10 batteries 
functioned is X = XX/n = 77/10 = 7.7. The latter value is greater than = 7, which is the 
mean number of hours claimed in the company's literature. If the researcher wants to determine 
whether the value X = 7.7 is significantly larger than the value u = 7, it can be argued that one 
should employ the single-sample ¢ test (Test 2) for the analysis. The rationale for employing 
the latter test is that if the null hypothesis H: o? = 5 is rejected, the researcher is concluding 
that the true value of the population variance is unknown, and consequently, the latter value 
should be estimated from the sample data. If, on the other hand, one does not employ the single- 
sample chi-square test for a population variance to evaluate the null hypothesis H: o^ =5 
(and is thus unaware of the fact that the data are not consistent with the latter null hypothesis), 
one will assume o? = 5 and employ the single-sample z test (Test 1) to evaluate the null 
hypothesis H,: u = 7 (since the latter test is employed when the value of o? is known). 


VI. Additional Analytical Procedures for the Single-Sample 
Chi-Square Test for a Population Variance and/or Related Tests 


1. Large sample normal approximation of the chi-square distribution When the sample 
size is larger than 30, the normal distribution can be employed to approximate the test statistic 
for the single-sample chi-square test for a population variance. Equation 3.4 is employed to 
compute the normal approximation. 


Z= (Equation 3.4) 





To illustrate the use of Equation 3.4, let us assume that in Example 3.1 the computed value 
$? = 12.01 is based on a sample size of n = 30. Employing Equation 3.2, the value x? = 69.66 
is computed. 


p = G0 = 02.01) _ 69 66 


Employing Table A4, for df = 30 - 1 = 29, the tabled critical two-tailed .05 and .01 
values in the upper tail of the chi-square distribution are ons = 45.72 and X oos = 52.34, and 
the tabled critical one-tailed .05 and .01 values in the upper tail of the distribution are 
Ys - 42.56 and Xa = 49.59. Since the obtained value X? = 69.66 is greater than all of the 
aforementioned critical values, the nondirectional alternative hypothesis H}: o? + 5 and the 
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directional alternative hypothesis H,: o? > 5 are supported at both the .05 and .01 levels. 
Equation 3.4 will now be employed to evaluate the data for Example 3.1, if n = 30. Since 
o? = 5 and §* = 12.01, it follows that o = 2.24 and § = 3.47. These values are substituted 
in Equation 3.4. 
3.47 - 2.24 


z= ELA. 424 
2.24 


(2)(30) 


Employing Table A1 (Table of the Normal Distribution) in the Appendix, we determine 
that the tabled critical two-tailed .05 and .01 values are zy, = 1.96 and z,, = 2.58, and the 
tabled critical one-tailed .05 and .01 values are Z,, = 1.65 and z,, = 2.33. Since the obtained 
value z = 4.24 is greater than all of the aforementioned critical values, the nondirectional 
alternative hypothesis H,: o? + 5 and the directional alternative hypothesis H: o > 5 are 
supported at both the .05 or .01 levels. The conclusions derived through use of the normal 
approximation are identical to those reached with Equation 3.2. 

If Equation 3.4 is employed with Example 3.1 for n = 10, it results in the value z = 2.46. 
Specifically: z = [3.47 - 2.24]/ [2.24//(2)(10)] = 2.46. Since the latter value is greater 
than the tabled critical two-tailed value zo; = 1.96 and the tabled critical one-tailed values 
Zos = 1.65 and z = 2.33, the nondirectional alternative hypothesis H,: o? + 5 is supported 
at the .05 level and the directional alternative hypothesis H,: o? > 5 is supported at both the 
.05 and .01 levels. Recollect that when Equation 3.2 is employed with Example 3.1, the 
nondirectional alternative hypothesis H: o? + 5 and the directional alternative hypothesis 
H: o? > 5 are both supported, but only at the .05 level. Thus, when the normal approximation 
is employed with a small sample size, it appears to inflate the likelihood of committing a Type 
I error (since the normal approximation supports the nondirectional alternative hypothesis 
H: o? > 5 atthe .01 level). 


2. Computation of a confidence interval for the variance of a population represented by 
a sample An equation for computing a confidence interval for the variance of a population (as 
well as the population standard deviation) can be derived by algebraically transposing the terms 
in Equation 3.2. As is noted in the discussion of the single-sample ź test, a confidence interval 
is a range of values within which a researcher can be confident to a specified degree that the true 
value of a population parameter falls. Stated probabilistically, a confidence interval identifies 
the range of values for which there is a specific likelihood that the population parameter falls 
within. Equation 3.5 is the general equation for computing a confidence interval for a population 
variance. 

Qu Du eos CEP ERN (Equation 3.5) 

XU - (a/2)] X(a/2) 


Where: Xi 2) is the tabled critical two-tailed value in the chi-square distribution below which 
a proportion (percentage) equal to [1 — (0/2)] of the cases falls. If the proportion 
(percentage) of the distribution that falls within the confidence interval is subtracted 
from 1 (100%), it will equal the value of a. 


Equations 3.6 and 3.7 are employed to compute the 95% and 99% confidence intervals for 


a population variance. The critical values employed in Equation 3.6 demarcate the middle 95% 
of the chi-square distribution, while the critical values employed in Equation 3.7 demarcate the 
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middle 99% of the distribution. 


-DS ge gts Ds (Equation 3.6) 


2 2 
X.915 X.025 


(n-1) 5% T€ (n-1) 57 (Equation 3.7) 


2 2 
X995 X005 


Using the data for Example 3.1, Equation 3.6 is employed to compute the 95% confidence 
interval for the population variance. 


0 - 1) (2.01) | aiu (10 - 1) (12.01) 
19.02 p i 2.70 


5.68 « o? « 40.03 


Thus, we can be 95% confident (or the probability is .95) that the true value of the variance 
of the population the sample represents falls between the values 5.68 and 40.03. By taking the 
square root of the latter values, we can determine the 95% confidence interval for the population 
standard deviation. Thus, 2.38 « o < 6.33. In other words, we can be 95% confident (or the 
probability is .95) that the true value of the standard deviation of the population the sample 
represents falls between the values 2.38 and 6.33. 

Equation 3.7 is employed below to compute the 99% confidence interval. 


Q0 - 1) (12.01). TE (10 - 1) (12.01) 
23.59 p p 1.73 


4.58 « o? « 62.48 


Thus, we can be 99% confident (or the probability is .99) that the true value of the variance 
of the population the sample represents falls between the values 4.58 and 62.48. By taking the 
square root of the latter values, we can determine the 99% confidence interval for the population 
standard deviation. Thus, 2.14 « o « 7.90. In other words, we can be 99% confident (or the 
probability is .99) that the true value of the standard deviation of the population the sample 
represents falls between the values 2.14 and 7.90. Note that (as is the case for confidence 
intervals for a population mean) a larger range of values defines the 99% confidence interval for 
a population variance than the 95% confidence interval. 

When n > 30, the normal distribution can be employed to approximate the confidence 
interval for a population standard deviation. Equation 3.8 is the general equation for computing 
a confidence interval using the normal approximation. 

——— o a (Equation 3.8) 
1+ C) 1x um 


y2n y2n 


Where: z,,, is the tabled critical two-tailed value in the normal distribution below which a 
proportion (percentage) equal to [1 — (a/2)] of the cases falls. If the proportion 
(percentage) of the distribution that falls within the confidence interval is subtracted 
from 1 (100%), it will equal the value of a. 
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Equations 3.9 and 3.10 employ the normal approximation to compute the 95% and 99% 
confidence intervals for a population standard deviation. The values z o, and z ọ used in the 
latter equations are the tabled critical two-tailed .05 and .01 values zy; = 1.96 and zy, = 2.58. 
By squaring a value obtained for the confidence interval for a population standard deviation, one 
can determine the confidence interval for the population variance. 


3 E MR (Equation 3.9) 
1 + i5 1- 298 
n Jn 
$ $ : 
€ g < ———— (Equation 3.10) 
jia Aon i 


y2n y2n 


For purposes of illustration, let us assume that n = 30 in Example 3.1. Using n = 30 and 
$? = 12.01 for Example 3.1, Equation 3.9 is employed to compute the 95% confidence interval 
for the population standard deviation and variance. 


3.47 3.47 
LII sus 


|. 196 |. 196 


(2)(30) (2)(30) 
2.71 < o < 4.65 
7.67 «0? < 21.62 


Using n = 30 and $? = 12.01 for Example 3.1, Equation 3.10 is employed to compute 
the 99% confidence interval for the population standard deviation and variance. 


3.47 3.47 
— an — M s 


1 2.58 | | 2.58 
re erdt lo SS 
(2)(30) (2)(30) 
2.61 «0 « 5.18 


6.81 « o? < 26.83 


If Equations 3.6 and 3.7 are employed with Example 3.1 for n = 30 and $? = 12.01, they 
yield values close to those obtained with Equations 3.9 and 3.10. As the size of the sample 
increases, the values that define a confidence interval based on the use of the normal versus chi- 
square distributions converge, and for large sample sizes the two distributions yield the same 
values. 


Equation 3.6 is employed to compute the 95% confidence interval. Note that for df= 29, 
the tabled critical values Cons = 16.05 and Xans = 45.72 are employed in Equation 3.6. 


GO - D (201) | 5. GO - 1) 2.01) 
45.72 7 7 16.05 


7.62 < o? < 21.70 
2.76 < o < 4.66 
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Equation 3.7 is employed to compute the 99% confidence interval. Note that for df= 29, 
the tabled critical values Xaos = 13.21 and Xos = 52.34 are employed in Equation 3.7. 


GO - 1) (12.01) . io. (30 - 1) (12.01) 
52.34 2 p 13.21 








6.65 < o? < 26.37 
2.58 < 0 < 5.13 


In closing the discussion of confidence intervals, the reader should take note of the fact that 
the range of values which defines the 95% and 99% confidence intervals for Example 3.1 is 
extremely large. Because of this the researcher will not be able stipulate with great precision a 
single value that is likely to represent the true value of the population variance/standard 
deviation. This illustrates that a confidence interval may be quite limited in its ability to 
accurately estimate a population parameter, especially when the sample size upon which it is 
based is small? The reader should also take note of the fact that the reliability of Equations 3.5 
and 3.8 will be compromised if one or more of the assumptions of the single-sample chi-square 
test for a population variance is saliently violated. 


3. Sources for computing the power of the single-sample chi-square test for a population 
variance Although it will not be described in this book, the protocol for computing the power 
of the single-sample chi-square test for a population variance is described in Guenther (1965) 
(who provides computational guidelines and graphs for quick power computations) and Zar 
(1999). 


VII. Additional Discussion of the Single-Sample Chi-Square Test 
for a Population Variance 


No additional material will be discussed in this section. 


VIII. Additional Examples Illustrating the Use of the Single- 
Sample Chi-Square Test for a Population Variance 


With the exception of Examples 1.5 and 1.6, the single-sample chi-square test for a population 
variance can be employed to test a hypothesis about a population variance (or standard 
deviation) with any of the examples that are employed to illustrate the single-sample z test 
(Examples 1.1—1.4) and the single-sample ¢ test (Examples 2.1 and 2.2). 

Examples 3.2-3.5 are four additional examples that can be evaluated with the single- 
sample chi-square test for a population variance. Since these examples employ the same 
population parameters and sample data used in Example 3.1, they yield the same result. The 
reader should take note of the fact that although in Examples 3.4 and 3.5 the value o. = 2.24 is 
given for the population standard deviation, when the latter value is squared, it results in o? = 5. 
A different set of data are employed in Example 3.6, which is the last example presented in this 
section. 


Example 3.2 A manufacturer of a machine that makes ball bearings claims that the variance 
of the diameter of the ball bearings produced by the machine is 5 millimeters. A company that 
has purchased the machine measures the diameter of a random sample of ten ball bearings. The 
computed estimated population variance of the diameters of ten ball bearings is 12.01 
millimeters. Is the obtained value §* = 12.01 consistent with the null hypothesis A: o = 5? 
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Example 3.3 A meteorologist develops a theory which predicts a variance of five degrees for 
the temperature recorded at noon each day in a canyon situated within a mountain range 100 
miles east of the South Pole. Over a ten-day period the following temperatures are recorded: 
5, 6, 4, 3, 11, 12, 9, 13, 6, 8. Do the data support the meteorologist's theory? 


Example 3.4 A chemical company claims that the standard deviation for the number of tons 
of waste it discards annually is o 2 2.24. Assume that during a ten-year period the following 
annual values are recorded for the number of tons of waste discarded: 5, 6, 4, 3, 11, 12, 9, 13, 
6, 8. Are the data consistent with the company's claim? 


Example3.5 During more than three decades of teaching, a college professor determines that 
the mean and standard deviation on a chemistry final examination are respectively 7 and 2.24. 
During the fall semester of a year in which he employs a new teaching method, ten students who 
take the final examination obtain the following scores: 5,6, 4,3, 11, 12,9, 13,6,8. Do the data 
suggest that the new teaching method results in an increase in variability in performance? 


Example 3.6 A pharmaceutical company claims that upon ingesting its cough medicine a 
person ceases to cough almost immediately. The company claims that the standard deviation of 
the number of coughs emitted by a person after ingesting the medicine is 5. In order to evaluate 
the company's claim, a physician evaluates a random sample of ten patients who come into his 
office coughing excessively. The physician records the following number of coughs emitted by 
each of the patients after he is given a therapeutic dosage of the medicine: 9, 10, 8, 4, 8, 3, 0, 
10, 15, 9. Is the variability of the data consistent with the company's claim for the product? 


The data for Example 3.6 are identical to those employed for Example 2.1, which is used 
to illustrate the single-sample ¢ test. In the computations for the latter test, through use of 
Equation 2.1, it is determined that the estimated population standard deviation for the data is 
§ = 4.25. Since the population variance is the square of the standard deviation, we can 
determine that o? = (5)? = 25 and §* = (4.25)? = 18.06. Since the company claims that the 
population standard deviation is 5, the null hypothesis for Example 3.6 can be stated as either 
Hy o = 5 or Hy o? = 25. The nondirectional and directional alternative hypotheses (stated 
in reference to Hy: o = 5) that can be employed for Example 3.6 are as follows: H,: o # 5; 
H: o > 5;and H: o. < 5. When Equation 3.2 is employed to evaluate the null hypothesis 
Hy: o = 5, the value Xi = 6.50 is computed. 

Ó- (10 - 118.060) _ 6.50 
25 


Since n = 10, the tabled critical values for df = 9 are employed to evaluate the computed 
value y? = 6.50. Since the latter value is less than the tabled critical two-tailed .05 value 
Xos = 19.02 in the upper tail of the chi-square distribution and greater than the tabled critical 
two-tailed .05 value Xos = 2.70 in the lower tail of the distribution, the nondirectional 
alternative hypothesis H,: o # 5 is not supported. Since x? = 6.50 is less than the tabled 
critical one-tailed .05 value Xs - 16.92 in the upper tail of the distribution, the directional 
alternative hypothesis H;: o > 5 isnot supported. Since X3 = 6.50 is greater than the tabled 
critical one-tailed .05 value Xos = 3.33 in the lower tail of the distribution, the directional 
alternative hypothesis H;: o < 5 is not supported. Thus, regardless of which alternative 
hypothesis one employs, the data do not contradict the company's statement that the population 
standard deviation is o = 5. 
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Endnotes 


1. Most sources note that violation of the normality assumption is much more serious for the 
single-sample chi-square test for a population variance than it is for tests concerning the 
mean of a single sample (i.e., the single-sample z test and the single-sample f test). 
Especially in the case of small sample sizes, violation of the normality assumption can 
severely compromise the accuracy of the tabled values employed to evaluate the chi-square 
statistic. 


2. One can also state the null and alternative hypotheses in reference to the population standard 
deviation (which is the square root of the population variance). Since in Example 3.1 
o = Vo? = /5 = 2.24, one can state the null hypothesis and nondirectional and directional 
alternative hypotheses as follows: Hy 0 = 2.24; H: 0 # 2.24; H,: 0 > 2.24; and 
H: o < 224. 


3. The use of the chi-square distribution in evaluating the variance is based on the fact that for 
any value of n, the sampling distribution of $? has a direct linear relationship to the chi- 
square distribution for df 2 n — 1. As is the case for the chi-square distribution, the sampling 
distribution of $? is positively skewed. Although the average of the sampling distribution 
for $? will equal o”, because of the positive skew of the distribution, a value of $ ? is more 
likely to underestimate rather than overestimate the value of 07. 


4. When the chi-square distribution is employed within the framework of the single-sample chi- 
square test for a population variance, it is common practice to employ critical values 
derived from both tails of the distribution. However, when the chi-square distribution is used 
with other statistical tests, one generally employs critical values that are only derived from 
the right tail of the distribution. Examples of chi-square tests which generally only employ 
the right tail of the distribution are the chi-square goodness-of-fit test (Test 8) and the chi- 
square test for r x c tables (Test 16). 


5. Although the procedure described in this section for computing a confidence interval for a 
population variance is the one that is most commonly described in statistics books, it does not 
result in the shortest possible confidence interval that can be computed. Hogg and Tanis 
(1988) describe a method (based on Crisman (1975)) requiring more advanced mathematical 
procedures that allows one to compute the shortest possible confidence interval for a 
population variance. For large sample sizes the difference between the latter method and the 
method described in this section will be trivial. 
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Test 4 


The Single-Sample Test for Evaluating 
Population Skewness 
(Parametric Test Employed with Interval/Ratio Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Does a sample of n subjects (or objects) come from a popula- 
tion distribution that is symmetrical (i.e., not skewed)? 


Relevant background information on test Prior to reading this section the reader should 
review the discussion of skewness in the Introduction. As noted in the Introduction, skewness 
is a measure reflecting the degree to which a distribution is asymmetrical. From a statistical 
perspective, the skewness of a distribution represents the third moment about the mean (m, ), 
which is represented by Equation 4.1 (which is identical to Equation I.14). 


yx - xy 
m, ( - ) 


(Equation 4.1) 


Skewness can be employed as a criterion for determining the goodness-of-fit of data with 
respect to a normal distribution. Various sources (e.g., D'Agostino (1970, 1986), D'Agostino 
and Stephens (1986), D'Agostino et al. (1990)) state that in spite of the fact that it is not as 
commonly employed as certain alternative goodness-of-fit tests, the single-sample test for 
evaluating population skewness provides an excellent test for evaluating a hypothesis of 
goodness-of-fit for normality, when it is employed in conjunction with the result of the single- 
sample test for evaluating population kurtosis (Test 5). The results of the latter two tests are 
employed in the D'Agostino-Pearson test of normality (Test 5a), which is described in Section 
VI of the single-sample test for evaluating population kurtosis. D'Agostiono (1986), 
D’ Agostino et al. (1990) and Zar (1999) state that the D'Agostino-Pearson test of normality 
provides for a more powerful test of the normality hypothesis than does either the Kolmogorov- 
Smirnov goodness-of-fit test for a single sample (Test 7) or the chi-square goodness-of-fit 
test (Test 8) (both of which are described later in the book). D’ Agostino et al. (1990) state that 
because of their lack of power, the latter two tests should not be employed for assessing 
normality. Other sources, however, take a more favorable attitude towards the 
Kolmogorov-Smirnov goodness-of-fit test for a single sample and the chi-square goodness- 
of-fit test as tests of goodness-of-fit for normality (e.g., Conover (1980, 1999), Daniel (1990), 
Hollander and Wolfe (1999), Marascuilo and McSweeney (1977), Siegel and Castellan (1988), 
and Sprent (1993)). 

In the Introduction it was noted that since the value computed for m, is in cubed units, it 
is often converted into the unitless statistic g,. The latter, which is an estimate of the population 
parameter y, (where y represents the lower case Greek letter gamma), is commonly employed 
to express skewness. When a distribution is symmetrical (about the mean), the value of g, will 
equal 0. When the value of g, is significantly above 0, a distribution will be positively skewed, 
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and when it is significantly below 0, a distribution will be negatively skewed. Although the 
normal distribution is symmetrical (with g, - 0), as noted earlier, not all symmetrical distribu- 
tions are normal. Examples of nonnormal distributions that are symmetrical are the ¢ distribution 
and the binomial distribution, when 7, - .5 (the meaning of the notation m, - .5 is explained 
in Section I of the binomial sign test for a single sample (Test 9)). 

It was also noted in the Introduction that some sources (e.g., D' Agostino (1970, 1986) and 
D'Agostino et al. (1990)) convert the value of g, into the statistic (a . The latter is an estimate 
of a population parameter designated Bi (where P represents the lower case Greek letter beta), 
which is also employed to represent skewness. When a distribution is symmetrical (such as in 
the case of a normal distribution), the value of Jb. will equal 0. When the value of (a is sig- 
nificantly above 0, a distribution will be positively skewed, and when it is significantly below 
0, a distribution will be negatively skewed. 

The single-sample test for evaluating population skewness is the procedure for 
determining whether a g, and/or (> value deviate significantly from 0. The normal distribution 
is employed to provide an approximation of the exact sampling distribution for the statistics g, 
and (a . Thus, the test statistic computed for the single-sample test for evaluating population 
skewness is a z value. 


II. Example 


Example 4.1 A researcher wishes to evaluate the data in three samples (comprised of 10 scores 
per sample) for skewness. Specifically, the researcher wants to determine whether or not the 
samples meet the criteria for symmetry as opposed to positive versus negative skewness. The 
three samples will be designated Sample A, Sample B, and Sample C. The researcher has 
reason to believe that Sample A is derived from a symmetrical population distribution, Sample 
B from a negatively skewed population distribution, and Sample C from a positively skewed 
population distribution. The data for the three distributions are presented below. 


Distribution A: 0, 0, 0, 5, 5, 5, 5, 10, 10, 10 
Distribution B: 0, 1, 1, 9, 9, 10, 10, 10, 10, 10 
Distribution C: 0,0,0,0,0,1,1,9,9,10 


9M 


Are the data consistent with what the researcher believes to be true regarding the 
underlying population distributions ? 


III. Null versus Alternative Hypotheses 


Null hypothesis Hy: ¥, = 0 or Hy vB, =0 


(The underlying population distribution the sample represents is symmetrical — in which case 
the population parameters y, and vB, are equal to 0.) 


Alternative hypothesis H: y, * 0 or H: JB, + 0 


(The underlying population distribution the sample represents is not symmetrical — in which 
case the population parameters y, and vB, are not equal to 0. This is a nondirectional alter- 
native hypothesis, and it is evaluated with a two-tailed test. In order to be supported, the 
absolute value of z must be equal to or greater than the tabled critical two-tailed z value at the 
prespecified level of significance. Thus, either a significant positive z value or a significant 
negative z value will provide support for this alternative hypothesis.) 
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Or 
A: y; > 0 or Hy: JB, > 0 


(The underlying population distribut ion the sample represents ispositively skewed — in which 
case the population parameters y, and Bi are greater than 0. This is a directional alternative 
hypothesis, and it is evaluated with a one-tailed test. It will only be support ed if the sign ofz 
is positive, and the ab solute value ofz is equal to or greater than the tabled critical one-tailed z 
value at the prespecified level of significance.) 


or 


H: y < 0 or A: JP, < 0 
(The underlying population distribut ion the sample represents isnegatively skewed — in which 
case the population parameters y, and Bi are less than 0. This is a directional alternative 
hypothesis, and it is evaluated with a one-tailed test. It will only be support ed if the sign ofz 
is negative, and the ab solute value ofz is equal to or greater than the tabled critical one-tailed z 
value at the prespecified level of significance.) 


Note: Only one of the above not ed alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected. 


IV. Test Computations 


The three distribut ions presented inExample 4.1 are identical to Distributions A, B, and C 
employed in the Introduction to demonstrate the computation of the values m,, g,, and ybi ; 
Employing Equations I.17/1.18, I.19, and I.20, the following values were previously computed 
for m,, g,, and Jb; for Distributions A, B, and C: m, - 0, m, - -86.67, m, - 86.67, 


8p = 0, 8i, = -1.02, a= 1.02, and ^u = 0, (Pr, = -.86, (Pr. = .86 (the computation 
of the latter values is summarized in Tables 1.2-1.4). 

Equations 4.2—4.8 (which are presented in Zar (1999, pp. 115-116)) summarize the steps 
that are involved in computing the test statistic (which, as is noted above, isaz value) for the 
single-sample test for evaluating population skewness. Zar (1999) states that Equation 4.8 
provides a good approximation of the exact probabilities for t he sampling distribut ion of, 
(which is employed to compute the value of ybi that is used in Equation 4.2), when n = 9. 
Note that in Equations 4.6 and 4.8, the notation /n represents the natural logarithm of a number 
(which is defined in Endnote 5 in the Introduction). 


(n + Dn + 3) + Da + 3) : 
A= "A T (Equation 4.2) 


3(n? + 27n - 70)n + 1)(n + 3) 
(n - 2)(n + 5n + Tn + 9) 


B- (Equation 4.3) 


= y2(B - 1) - 1 (Equation 4.4) 


D ec (Equation 4.5) 


E (Equation 4.6) 


In D 





; 
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F = — (Equation 4.7) 
2 
C —1 
z = Eln(F + yF? + 1) (Equation 4.8) 


Employing Equations 4.2-4.8, the values z, = 0, z, = -1.53, and zę = 1.53 are 
computed for Distributions A, B, and C. 


Distribution A 





4 o 495 Dd0 «3. 9 
6(10 - 2) 


3[10? + 27(10) - 70110 + D(10 + 3) 
(10 - 2)(10 + 50 + 7)(10 + 9) 


B= = 3.32 


C = 728.32 - 1 - 1 = 145 
D = J115 = 1.07 


1 


yin 1.07 


E = = 3.84 





1.15 - 1 


Za = 3.84 Ino + ¥() + 1| -0 


Distribution B 





A= EE CE = -1.48 
6(10 - 2) 


3[10? + 2700) - 700 + DAO + 3) _ 33 
(10 - 2)10 + 510 + 7)(10 + 9) 


B- 2 


C-42832-D-1- 145 





D - JII5 = 1.07 
pat gg 
Jin 107 
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pacc I9. oe cdi 


- 
115 - 1 


zp = 341n|-41 + (CAD? + 1| = -1.53 


Distribution C 


a =- 86 | H9 DID +3) _ 14g 
N 640 -2) 


3[10? + 27(10) - 70110 + DAO + 3) 
(10 - 2)(10 + 510 + 7)(10 + 9) 





= 3.32 





C-42832-D - 1 = 145 
D = v1.15 = 1.07 


a ae 
Jin 1.07 
pe NET 


EM 
115 - 1 
zo = 384 In|a1 + (A17 1] = 153 


V. Interpretation of the Test Results 


The obtained values z, - 0,z, - -1.53,and fa 1.53 are evaluated with Table A1 (Table 
ofthe Normal Distribution) in the Appendix. In Table A1 the tabled critical two-tailed .05 and 
.01 values are zo, = 1.96 and z,, = 2.58, and the tabled critical one-tailed .05 and .01 values 
are zo, = 1.65 and zy, = 2.33. Since the computed absolute values z, = 0, z, = 1.53,and 
zc = 1.53 are all less than the tabled critical two-tailed value zo, = 1.96 and the tabled critical 
one-tailed value zo; = 1.65, the null hypothesis cannot be rejected, regardless of which 
alternative hypothesis is employed. 

The computation of the value z, - 0 for Distribution A is consistent with the fact that the 
latter distribution is employed to represent a symmetrical distribution. Thus, the nondirectional 
alternative hypothesis H,: y, * 0 (or H,: Jp, * 0)1is not supported. Whenever a distribution 
has perfect symmetry g, (as well as (a ), will equal 0, and consequently the value computed for 
z will also equal 0. 

Although not statistically significant, the data for Distribution B are consistent with the 
directional alternative hypothesis H,: y, < 0 (or H;: (Bi « 0). Similarly, although not 
statistically significant, the data for Distribution C are consistent with the directional alternative 
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hypothesis H,: y, > 0 (or H,: Bi > 0). Note that because Bn -1.02 and [by = -.86 are 


negative numbers, a negative z value is obtained for Distribution B (which is hypothesized to 
represent a negatively skewed distribution). In the same respect, since g, = 1.02 and 
C 


|b,. = .86 are positive numbers, a positive z value is obtained for Distribution C (which is 
C 


hypothesized to represent a positively skewed distribution). Whenever a distribution is 
negatively skewed, the computed value of z will be a negative number, and whenever a 
distribution is positively skewed, the computed value of z will be a positive number. The fact 
that the values z, = -1.53 and zę = 1.53 are not statistically significant (although they are not 
that far removed from the tabled critical one-tailed .05 value zy, = 1.65), in large part may be 
a attributed to the fact that a small sample size (i.e., n = 10) is employed to represent each 
distribution. A small sample size severely reduces the power of a statistical test, making it more 
difficult to obtain a statistically significant result (i.e., in this case, a significant deviation from 
symmetry). 

It should be noted that in most instances when a researcher has reason to evaluate a 
distribution with regard to skewness, he will employ a sample size which is much larger than the 
value n 2 10 employed in Example 4.1. Section VII discusses tables that document the exact 
sampling distribution for the g, statistic, and contrasts the results obtained with the latter tables 
with the results obtained in this section. 


VI. Additional Analytical Procedures for the Single-Sample Test for 
Evaluating Population Skewness 


1. Note on the D'Agostino-Pearson test of normality (Test 5a) Most researchers would not 
consider the result of the single-sample test for evaluating population skewness, in and of 
itself, as sufficient evidence for establishing goodness-of-fit for normality. As noted in Section 
I, a procedure is presented in Section VI of the single-sample test for evaluating population 
kurtosis, which employs the z value based on the computed value of g, (which is employed to 
compute the value of "A that is used in Equation 4.2) and a z value based on computed value 
of g, (which is a measure of kurtosis that is discussed in the Introduction and in the single- 
sample test for evaluating population kurtosis) to evaluate whether or not a set of data is 
derived from a normal distribution. The latter procedure is referred to as the D'Agostino- 
Pearson test of normality. 


VII. Additional Discussion of the Single-Sample Test for Evaluating 
Population Skewness 


1. Exact tables for the single-sample test for evaluating population skewness Zar (1999) 
has derived exact tables for the absolute value of the g, statistic for sample sizes in the range 
9 <n < 1000. By employing the exact tables, one can avoid the tedious computations that are 
described in Section IV for the single-sample test for evaluating population skewness (which 
employs the normal distribution to approximate the exact sampling distribution). In Zar' s (1999) 
tables, the tabled critical two-tailed .05 and .01 values for g, are ET 1.359 and E 1.846, 


and the tabled critical one-tailed .05 and .01 values are g, Et 1.125 and g, T 1.643. In order 


to reject the null hypothesis, the computed absolute value of g, must be equal to or greater than 
the tabled critical value (and if a directional alternative hypothesis is evaluated, the sign of g, 
must be in the predicted direction). The probabilities derived for Example 4.1 (through use of 
Equation 4.8) are extremely close to the exact probabilities listed in Zar (1999). The probability 
for Distribution A is identical to Zar's (1999) exact probability. With respect to Distributions 
B and C, the probabilities listed in Zar's (1999) tables for the values Ej pet 1.02 and 


© 2000 by Chapman & Hall/CRC 


£j, = 1.02 are very close to the tabled probabilities in Table A1 for the computed values 
zg = -1.53 and zę = 1.53. Note that the absolute value g, = 1.02 just falls short of being 
significant at the .05 level if a one-tailed analysis is conducted. In the case of Example 4.1, the 
same conclusions regarding the null hypothesis will be reached, regardless of whether or not one 
employs the normal approximation or Zar's (1999) tables. 


2. Note on a nonparametric test for evaluating skewness Zar (1999, pp. 119—120) describes 
a nonparametric procedure for evaluating skewness/symmetry around the median of a distri- 
bution (as opposed to the mean). The latter test is based on the Wilcoxon signed-ranks test 
(Test 6), which is one of the nonparametric procedures described in this book. 


VIII. Additional Examples Illustrating the Use of the Single-Sample 
Test for Evaluating Population Skewness 


No additional examples will be presented in this section. 
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Test 5 


The Single-Sample Test for Evaluating 
Population Kurtosis 
(Parametric Test Employed with Interval/Ratio Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Does a sample of n subjects (or objects) come from a popu- 
lation distribution that is mesokurtic? 


Relevant background information on test Prior to reading this section the reader should 
review the discussion of kurtosis in the Introduction. As noted in the Introduction, kurtosis 
is a measure reflecting the degree of peakedness of a distribution. From a statistical perspective, 
the kurtosis of a distribution represents the fourth moment about the mean (m,), which is 
represented by Equation 5.1 (which is identical to Equation I.15). 


= BA XY 
n 


m, (Equation 5.1) 


Kurtosis can be employed as a criterion for determining the goodness-of-fit of data with 
respect to a normal distribution. Various sources (e.g., Anscombe and Glynn (1983), D’ Agostino 
(1986), D'Agostino and Stephens (1986), and D’ Agostino et al. (1990)) state that in spite of the 
fact that it is not as commonly employed as certain alternative goodness-of-fit tests, the single- 
sample test for evaluating population kurtosis provides an excellent test for evaluating a hy- 
pothesis of goodness-of-fit for normality, when it is employed in conjunction with the result of 
the single-sample test for evaluating population skewness (Test 4). The results of the latter 
two tests are employed in the D'Agostino-Pearson test of normality (Test 5a), which is 
described in Section VI. D'Agostiono (1986), D' Agostino et al. (1990), and Zar (1999) state that 
the D'Agostino-Pearson test of normality provides for a more powerful test of the normality 
hypothesis than does either the Kolmogorov-Smirnov goodness-of-fit test for a single sample 
(Test 7) or the chi-square goodness-of-fit test (Test 8) (both of which are described later in the 
book). D'Agostino eft al. (1990) state that because of their lack of power, the latter two tests 
should not be employed for assessing normality. Other sources, however, take a more favorable 
attitude towards the Kolmogorov-Smirnov goodness-of-fit test for a single sample and the chi- 
square goodness-of-fit test as tests of goodness-of-fit for normality (e.g., Conover (1980, 1999), 
Daniel (1990), Hollander and Wolfe (1999), Marascuilo and McSweeney (1977), Siegel and 
Castellan (1988), and Sprent (1993)). It should be noted that the single-sample test for evalu- 
ating population kurtosis is a test of mesokurtic normality — in other words, whether or not a 
distribution is mesokurtic. 

In the Introduction it was noted that since the value computed for m, is in units of the 
fourth power, it is often converted into the unitless statistic g,. The latter, which is an estimate 
of the population parameter y, (where y represents the lower case Greek letter gamma), is 
commonly employed to express kurtosis. When a distribution is mesokurtic, the value of g, will 
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equal 0. When the value of g, is significantly above 0, a distribution will be leptokurtic, and 
when it is significantly below 0, a distribution will be platykurtic. 

It was also noted in the Introduction that some sources (e.g., Anscombe and Glynn (1983), 
D'Agostino (1986), and D'Agostino et al. (1990)) convert the value of g, into the statistic b,. 
The latter is an estimate of a population parameter designated D, (where B represents the lower 
case Greek letter beta), which is also employed to represent kurtosis. When a distribution is 
mesokurtic, the value of b, will equal [3(n — 1)]/(n +1). Inspection of the latter equation reveals 
that as the value of the sample size increases, the value of b, approaches 3. When the value 
computed for b, is significantly below [3(n — 1)]/(n + 1), a distribution will be platykurtic. When 
the value the computed for b, is significantly greater than [3(n — 1)]/(n + 1), a distribution will 
be leptokurtic. 

The single-sample test for evaluating population kurtosis is the procedure for determining 
whether a g, and/or b, value deviate significantly from the expected value for a mesokurtic 
distribution. As noted earlier, any normal distribution will always be mesokurtic, with the 
following expected computed values: g,= 0 and b, = 3. 

The normal distribution is employed to provide an approximation of the exact sampling 
distribution for the statistics g, and b,. Thus, the test statistic computed for the single-sample 
test for evaluating population kurtosis is a z value. 


II. Example 


Example 5.1 A researcher wishes to evaluate the data in two samples (comprised of 20 scores 
per sample) for kurtosis. The two samples will be designated Sample E and Sample F. The 
researcher has reason to believe that Sample E is derived from a leptokurtic population 
distribution, while Sample F is derived from a platykurtic population distribution. The data for 
the two distributions are presented below. 


Distribution E: 2, 7, 8, 8, 8, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 12, 12, 12, 13, 18 
Distribution F: 0, 1, 3, 3, 5, 5, 8, 8, 10, 10, 10, 10, 12, 12, 15, 15, 17, 17, 19, 20 


Are the data consistent with what the researcher believes to be true regarding the 
underlying population distributions? 


III. Null versus Alternative Hypotheses 


Null hypothesis Hy Y, = 0 or Ay: B, = 3 

(The underlying population distribution the sample represents is mesokurtic — in which case the 
population parameter y, is equal to 0, and the population parameter p, is equal to 3.) 
Alternative hypothesis H,: y, # 0 or A,: B, # 3 


(The underlying population distribution the sample represents is not mesokurtic — in which case 
the population parameter y, is not equal to 0, and the population parameter p, is not equal to 3. 
This is a nondirectional alternative hypothesis, and it is evaluated with a two-tailed test. In 
order to be supported, the absolute value of z must be equal to or greater than the tabled critical 
two-tailed z value at the prespecified level of significance.) 


or 


H,: y, > 0 or H,: B, > 3 
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(The underlying population distribution the sample represents is leptokurtic — in which case the 
population parameter y, is greater than 0, and the population parameter p, is greater than 3. 
This is a directional alternative hypothesis, and it is evaluated with a one-tailed test. It will 
only be supported if g, > 0, and the absolute value of z is equal to or greater than the tabled 
critical one-tailed z value at the prespecified level of significance.) 


Or 
H:y,«OorH: p, «3 


(The underlying population distribution the sample represents is platykurtic — in which case the 
population parameter y, is less than 0, and the population parameter ß, is less than 3. This is a 
directional alternative hypothesis, and it is evaluated with a one-tailed test. It will only be 
supported if g, < 0 , and the absolute value of z is equal to or greater than the tabled critical 
one-tailed z value at the prespecified level of significance.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected. 


IV. Test Computations 


The two distributions presented in Example 5.1 are identical to Distributions E and F employed 
in the Introduction to demonstrate the computation of the values m,, g,, and b,. Employing 
Equations L21/L.22, 1.23 and 1.24, the following values were previously computed for m,, 
g,, and b, for Distributions E and F: m, = 307.170, m, = -1181.963, 82, = 3.596, 


85, 7 -.939, and b, = 5.472, b, = 1.994 (the computation of the latter values is 
summarized in Tables I.5—I.6). 

Equations 5.2—5.7 (which are presented in Zar (1999, pp. 116—118) summarize the steps that 
are involved in computing the test statistic (which, as is noted above, is a z value) for the single- 
sample test for evaluating population kurtosis. Zar (1999) states that Equation 5.7 provides 


a good approximation of the exact probabilities for the sampling distribution of g,, when n > 20. 


24n(n - 2)(n - 3) 














G = ———————ÓáR——— (Equation 5.2) 
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z = —————— (Equation 5.7) 


Employing Equations 5.2-5.7, the values z, = 2.40, and z, = 1.07 are computed for 
Distributions E and F. 
Distribution E 


G = Q4)20)20 - 2020 - 3) _ 579 


(20 + 1Q0 + 3)(20 + 5) 


(20 + 1)(20 - 1)y.579 


= 3.624 


- 6[20? - 5(20) + 2] | 6(20 + 3)(20 + 5) 
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V. Interpretation of the Test Results 





The obtained values z, = 2.40 and z, = 1.07 are evaluated with Table A1 (Table of the 
Normal Distribution) in the Appendix. In Table A1 the tabled critical two-tailed .05 and .01 
values are zog, = 1.96 and z,, = 2.58, and the tabled critical one-tailed .05 and .01 values are 
Zos = 1.65 and zo, = 2.33. 

With respect to Distribution E, since the obtained value z, = 2.40 is greater than the 
tabled critical two-tailed value zo; = 1.96, the nondirectional alternative hypothesis H,: y, * 0 
(or H,: D, # 3 )is supported at the .05 level. However, it is not supported at the .01 level, since 
Zg = 2.40 is less than the tabled critical two-tailed value zo, = 2.58. The directional alternative 
hypothesis H,: y, > 0 (or H,: B, > 3) is supported at both the .05 and .01 levels, since the 
obtained value £y = 3.596 is a positive number, and the obtained value z, = 2.40 is greater 


than the tabled critical one-tailed values zo, = 1.65 and zy, = 2.33. 

The directional alternative hypothesis H,: y, < 0 (or H,: B, < 3) not supported, since in 
order for the latter alternative hypothesis to be supported, the computed value of g, must be a 
negative number. 

With respect to Distribution E, the result of the single-sample test for evaluating 
population kurtosis allows one to conclude that there is a high likelihood the sample is derived 
from a population which is leptokurtic. 

In the case of Distribution F, since the obtained value z, = 1.07 is less the than the tabled 
critical two-tailed value zo; = 1.96 and the tabled critical one-tailed value zo, = 1.65, the null 
hypothesis cannot be rejected, regardless of which alternative hypothesis is employed. Although 
the value £j, = -.939 computed for Distribution F is a negative number, its absolute value is 


not large enough to warrant the conclusion that the sample is derived from a population that is 
platykurtic. 

It should be noted that in most instances when a researcher has reason to evaluate a 
distribution with regard to kurtosis, he will employ a sample size which is much larger than the 
value n = 20 employed in Example 5.1. Section VII discusses tables that document the exact 
sampling distribution for the g, statistic, and contrasts the results obtained with the latter tables 
with the results obtained in this section. 


VI. Additional Analytical Procedures for the Single-Sample Test for 
Evaluating Population Kurtosis 


1. Test 5a: The D'Agostino-Pearson test of normality Many researchers would not consider 


the result of the single-sample test for evaluating population kurtosis, in and of itself, as 
sufficient evidence for establishing goodness-of-fit for normality. In this section a procedure (to 
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be referred to as the D'Agostino-Pearson test of normality) will be described which employs 
the computed value of g, and the computed value of g, (which is a measure of skewness dis- 
cussed in the Introduction and in the single-sample test for evaluating population skewness) 
to evaluate whether a set of data is derived from a normal distribution. 

The D'Agostino-Pearson test of normality was developed by D'Agostino and Pearson 
(1973), who state that the test is the most effective procedure for assessing goodness-of-fit for 
a normal distribution. D'Agostino (1973, 1986) and Zar (1999) claim that the D'Agostino- 
Pearson test of normality is more effective for assessing goodness-of-fit than the more 
commonly employed Kolmogorov-Smirnov goodness-of-fit test for a single sample and the 
chi-square goodness-of-fit test. 

The null and alternative hypotheses that are evaluated with the D'Agostino-Pearson test 
of normality are as follows. 


Null hypothesis H}: The sample is derived from a normally distributed population. 


Alternative hypothesis H,: The sample is not derived from a normally distributed population. 
This is a nondirectional alternative hypothesis. 


The test statistic for the D’ Agostino—Pearson test of normality is computed with Equation 
5.8. 


X - zi * zi (Equation 5.8) 


The values as and n are respectively the square of the z values computed with Equation 
4.8 and Equation 5.7. Equation 4.8 is employed to compute the test statistic t (which is based 


on the value computed for ya ) for the single-sample test for evaluating population skewness, 
and Equation 5.7 is employed to compute the test statistic Ze, for the single-sample test for 
evaluating population kurtosis. The computed value of y? is evaluated with Table A4 (Table 
of the Chi-Square Distribution) in the Appendix. The degrees of freedom employed in the 
analysis will always be df= 2. The tabled critical .05 and .01 chi-squared values in Table A4 
for df = 2 are Yos - 5.90 and Xi = 9,21.' If the computed value of chi-square is equal to or 
greater than either of the aforementioned values, the null hypothesis can be rejected at the 
appropriate level of significance. If the null hypothesis is rejected, a researcher can conclude that 
a set of data does not fit a normal distribution. Through examination of the results of the single- 
sample test for evaluating population skewness and the single-sample test for evaluating 
population kurtosis, the researcher can determine if rejection of the null hypothesis is the result 
of a lack of symmetry and/or a departure from mesokurtosis. 

Ithappens to be the case that Distribution F is a symmetrical distribution, and thus 8i, = 0 


and |b,. = 0. Consequently, the value computed with Equation 4.7 will be a 0. When the 
latter value, along with the value E 1.07 computed for Distribution F with Equation 5.7, is 
substituted in Equation 5.8, the value 3? = 1.14 is obtained. 


 -(0 + (107? = 1.14 
Since the value 3X? = 1.14 is less than the tabled critical value Yos - 5.99, we are not able 
to reject the null hypothesis. Thus, we cannot conclude that the population distribution for 


Distribution F is nonnormal. 
It happens to be the case that Distribution E is also a symmetrical distribution, and thus 
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£7 0 and b, = 0. Consequently, the value computed with Equation 4.7 will be — 0. 
When the latter value, along with the value $7 2.40 computed for Distribution E with 
Equation 5.7, is substituted in Equation 5.8, the value y? = 5.76 is obtained (the latter value 
resulting entirely from the square of the significant value Za = 2.40, which indicated that 
Distribution E is leptokurtic). 


xb = (0 + (2.40)? = 5.76 


Since the value 3? = 5.76 is less (although not by much) than the tabled critical value 
Yos - 5.90, we are not able to reject the null hypothesis. However, the analysis of the data in 
Section IV with the single-sample test for evaluating population kurtosis suggests that 
Distribution E is clearly leptokurtic, and that in itself might be sufficient grounds for some 
researchers to conclude that it is unlikely that the underlying population is normal. Thus, in spite 
of the nonsignificant result with the D'Agostino-Pearson test of normality, some researchers 
might view it prudent to reject the null hypothesis purely on the basis of the outcome of the 
single-sample test for evaluating population kurtosis. 


VII. Additional Discussion of the Single-Sample Test for Evaluating 
Population Kurtosis 


1. Exact tables for the single-sample test for evaluating population kurtosis Zar (1999) has 
derived exact tables for the absolute value of the g, statistic for sample sizes in the range 
20 «n < 1000. By employing the exact tables, one can avoid the tedious computations that are 
described in Section IV for the single-sample test for evaluating population kurtosis (which 
employs the normal distribution to approximate the exact sampling distribution). In Zar's (1999) 
tables, the tabled critical two-tailed .05 and .01 values for g, are 84, = 2.486 and 8, = 4.121, 
and the tabled critical one-tailed .05 and .01 values are fo 7 1.850 and $c 3.385. In order 
to reject the null hypothesis, the computed absolute value of g, must be equal to or greater than 
the tabled critical value (and if a directional alternative hypothesis is evaluated, the sign of g, 
must be in the predicted direction). The probabilities derived for Example 5.1 (through use of 
Equation 5.7) are very close to the exact probabilities listed in Zar (1999). In other words, the 
probabilities listed in Zar's (1999) tables for the absolute values £j = 3.596 and £j = .939 are 
almost the same as the tabled probabilities in Table A1 for the computed values z, = 2.40 and 
Zp = 1.07. In the case of Example 5.1, the same conclusions regarding the null hypothesis 
will be reached, regardless of whether one employs the normal approximation or Zar's (1999) 
tables. 


VIII. Additional Examples Illustrating the Use of the Single-Sample 
Test for Evaluating Population Kurtosis 


No additional examples will be presented in this section. 
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Endnotes 


1. When Table A4 is employed to evaluate a chi-square value computed for the D'Agostino- 
Pearson test of normality, the following protocol is employed. The tabled critical values for 
df = 2 are derived from the right tail of the distribution. Thus, the tabled critical .05 chi- 
square value (to be designated os) will be the tabled chi-square value at the 95th percentile. 
In the same respect, the tabled critical .01 chi-square value (to be designated Y» ) will be the 
tabled chi-square value at the 99th percentile. For further clarification of interpretation of the 
critical values in Table A4, the reader should consult Section V of the single-sample chi- 
square test for a population variance (Test 3). 
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Test 6 


The Wilcoxon Signed-Ranks Test 
(Nonparametric Test Employed with Ordinal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Does a sample of n subjects (or objects) come from a popu- 
lation in which the median value (0) equals a specified value? 


Relevant background information on test The Wilcoxon signed-ranks test (Wilcoxon 
(1945, 1949)) is a nonparametric procedure employed in a hypothesis testing situation involving 
a single sample in order to determine whether a sample is derived from a population with a 
median of 0. (The population median will be represented by the notation 0, which is the lower 
case Greek letter theta.) If the Wilcoxon signed-ranks test yields a significant result, the 
researcher can conclude there is a high likelihood the sample is derived from a population with 
a median value other than 0. 

The Wilcoxon signed-ranks test is based on the following assumptions: a) The sample 
has been randomly selected from the population it represents; b) The original scores obtained for 
each of the subjects/objects are in the format of interval/ratio data; and c) The underlying 
population distribution is symmetrical. When there is reason to believe that the latter assumption 
is violated, Daniel (1990), among others, recommends that the binomial sign test for a single 
sample (Test 9) be employed in place of the Wilcoxon signed-ranks test.” Proponents of non- 
parametric tests recommend that the Wilcoxon signed-ranks test be employed in place of the 
single-sample t test (Test 2) when there is reason to believe that the normality assumption of the 
latter test has been saliently violated? It should be noted that all of the other tests in this text that 
rank data (with the exception of the Wilcoxon matched-pairs signed-ranks test (Test 18) and 
the Moses test for equal variability (Test 15)) rank the original interval/ratio scores of subjects. 
The Wilcoxon signed-ranks test, however, does not rank subjects' original interval/ratio scores, 
but instead ranks difference scores — specifically, the obtained difference between each subject’ s 
score and the hypothesized value of the population median. For this reason, some sources cate- 
gorize the Wilcoxon signed-ranks test as a test of interval/ratio data. Most sources, however, 
(including this book) categorize the test as one involving ordinal data, because a ranking pro- 
cedure is part of the test protocol. 


II. Example 


Example 6.1 A physician states that the median number of times he sees each of his patients 
during the year is five. In order to evaluate the validity of this statement, he randomly selects ten 
of his patients and determines the number of office visits each of them made during the past year. 
He obtains the following values for the ten patients in his sample: 9, 10, 8, 4, 8, 3, 0, 10, 15, 
9. Do the data support his contention that the median number of times he sees a patient is five? 
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III. Null versus Alternative Hypotheses 


Null hypothesis Hj; 9 = 5 


(The median of the population the sample represents equals 5. With respect to the sample data, 
this translates into the sum of the ranks of the positive difference scores being equal to the sum 
of the ranks of the negative difference scores (i.e., ZR+ = R-).) 


Alternative hypothesis H,: 8 #5 


(The median of the population the sample represents does not equal 5. With respect to the 
sample data, this translates into the sum of the ranks of the positive difference scores not being 
equal to the sum of the ranks of the negative difference scores (i.e., R+ + XR-). This is a non- 
directional alternative hypothesis and it is evaluated with a two-tailed test.) 


or 
H: 08> 5 


(The median of the population the sample represents is some value greater than 5. With respect 
to the sample data, this translates into the sum of the ranks of the positive difference scores being 
greater than the sum of the ranks of the negative difference scores (i.e., XR+ > XR-). This is a 
directional alternative hypothesis and it is evaluated with a one-tailed test.) 


or 
Hx S 


(The median of the population the sample represents is some value less than 5. With respect to 
the sample data, this translates into the sum of the ranks of the positive difference scores being 
less than the sum of the ranks of the negative difference scores (i.e., R+ < XR-). This is a 
directional alternative hypothesis and it is evaluated with a one-tailed test.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected. 


IV. Test Computations 


The data for Example 6.1 are summarized in Table 6.1. 

The scores of the 10 subjects are recorded in Column 2 of Table 6.1. In Column 3, a D 
score is computed for each subject. This score, which is referred to as a difference score, is the 
difference between a subject's score and the hypothesized value of the population median, 0 = 
5. Column 4 contains the ranks of the difference scores. In ranking the difference scores for the 
Wilcoxon signed-ranks test, the following guidelines are employed: 

a) The absolute values of the difference scores (|D|) are ranked (i.e., the sign of a dif- 
ference score is not taken into account). 

b) Any difference score that equals zero is not ranked. This translates into eliminating from 
the analysis any subject who yields a difference score of zero. 

c) In ranking the absolute values of the difference scores, the following protocol should be 
employed: Assign a rank of 1 to the difference score with the lowest absolute value, a rank of 2 
to the difference score with the second lowest absolute value, and so on until the highest rank is 
assigned to the difference score with the highest absolute value. When there are tied scores 
present, the average of the ranks involved is assigned to all difference scores tied for a given rank. 


€ 2000 by Chapman & Hall/CRC 


Table 6.1 Data for Example 6.1 


Subject X D-X-90 Rank of |D| Signed rank of |D| 
1 9 4 5.3 3:9 
2 10 5 8 8 
3 8 3 3.5 3.5 
4 4 -i 1 -1 
5 8 3 3.5 3.5 
6 3 -2 2 -2 
7 0 -5 8 -8 
8 10 5 8 8 
9 15 10 10 10 

10 9 4 5-5 5.5 
ER+ = 44 
UR- =11 


Because of this latter fact, when there are tied scores for either the lowest or highest difference 
scores, the rank assigned to the lowest difference score will be some value greater than 1, and 
the rank assigned to the highest difference score will be some value less than n. To further 
clarify how ties are handled, examine Table 6.2 which lists the difference scores of the 10 
subjects. In the table, the difference scores (based on their absolute values) are arranged 
ordinally, after which they are ranked employing the protocol described above. 


Table 6.2 Ranking Procedure for Wilcoxon Signed-Ranks Test 


Subject number 4 6 3 5 1 10 2 7 8 9 
Subject’s difference score -1 -2 3 3 4 4 5 -5 5 10 
Absolute value of difference score 1 2 3 3 4 4 5 5 5 10 
Rank of |D| 1 2 35 35 55 55 8 8 8 10 


The difference score of Subject 4 has the lowest absolute value (i.e., 1), and because of this 
it is assigned a rank of 1. The next lowest absolute value for a difference score (2) is that of Sub- 
ject 6, and thus it is assigned a rank of 2.* The difference score of 3 (which is obtained for both 
Subjects 3 and 5) is the score that corresponds to the third rank-order. Since, however, there are 
two instance of this difference score, it will also use up the position reserved for the fourth rank- 
order (i.e., 3 and 4 are the two ranks that would be employed if, in fact, these two subjects did 
not have the identical difference score). Instead of arbitrarily assigning one of the subjects with 
a difference score of 3 a rank-order of 3 and the other subject a rank-order of 4, we compute the 
average of the two ranks that are involved (1.e., (3 + 4)/2 = 3.5), and assign that value as the rank- 
order for the difference scores of both subjects. The next rank-order in the sequence of the 10 
rank-orders is 5. Once again, however, two subjects (Subjects 1 and 10) are tied for the differ- 
ence score in the fifth ordinal position (which happens to involve a difference score of 4). Since, 
if not equal to one another, these two difference scores would involve the fifth and sixth ranks, 
we compute the average of these two ranks (i.e., (5 + 6)/2 = 5.5), and assign that value as the rank 
for the difference scores of Subjects 1 and 10. With respect to the next difference score (5), there 
is a three-way tie involving Subjects 2, 7, and 8 (keeping in mind that the absolute value of the 
difference score for Subject 7 is 5). The average of the three ranks which would be involved if 
the subjects had obtained different difference scores is computed (1.e., (7 + 8 + 9)/3 = 8), and that 
average value is assigned to the difference scores of Subjects 2, 7, and 8. Since the remaining 
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difference score of 10 (obtained by Subject 9) is the highest difference score, it is assigned the 
highest rank which equals 10. 

It should be emphasized that in the Wilcoxon signed-ranks test it is essential that a rank 
of 1 be assigned to the difference score with the lowest absolute value, and that the highest rank 
be assigned to the difference score with the highest absolute value. In most other tests that in- 
volve ranking, the ranking procedure can be reversed (i.e., the same test statistic will be obtained 
if one assigns a rank of 1 to the highest score and the highest rank to the lowest score). However, 
if one reverses the ranking procedure in conducting the Wilcoxon signed-ranks test, it will 
invalidate the results of the test. 

d) After ranking the absolute values of the difference scores, the sign of each difference 
score is placed in front of its rank. The signed ranks of the difference scores are listed in Column 
5 of Table 6.1. 

The sum of the ranks that have a positive sign (i.e., ZR+ = 44) and the sum of the ranks that 
have a negative sign (i.e., XR- = 11) are recorded at the bottom of the Column 5 in Table 6.1. 
Equation 6.1 allows one to check the accuracy of these values. If the relationship indicated by 
Equation 6.1 is not obtained, it indicates an error has been made in the calculations. In Equation 
6.1, n represents the number of signed ranks (i.e., the number of difference scores that are 
ranked). 


n(n + 1) 
2 


R+ + UR- = (Equation 6.1) 


Employing the values ©R+ = 44 and XR- = 11 in Equation 6.1, we confirm that the 
relationship described by the equation is true. 


44 +11 = a = 55 


It is important to note that in the event one or more subjects obtains a difference score of 
Zero, such scores are not employed in the analysis. In such a case, the value of n in Equation 6.1 
will only represent the number of scores that have been assigned ranks. Example 6.2 in Section 
VIII illustrates the use of the Wilcoxon signed-ranks test with data in which difference scores 
of zero are present. 


V. Interpretation of the Test Results 


As noted in Section III, if the sample is derived from a population with a median value equal to 
the hypothesized value of the population median (i.e., the null hypothesis is true), the values of 
XR-« and XR- will be equal to one another. When /R+ and XR- are equivalent, both of these 
values will equal [n(n + 1)]/4, which in the case of Example 6.1 will be [(10)(11)]/4 = 27.5. This 
latter value is commonly referred to as the expected value of the Wilcoxon T statistic. 

If the value of /R+ is significantly greater than the value of R-, it indicates there is a high 
likelihood the sample is derived from a population with a median value which is larger than the 
hypothesized value of the population median. On the other hand, if XR- is significantly greater 
than XR, it indicates there is a high likelihood the sample is derived from a population with a 
median value that is less than the hypothesized value of the population median. The fact that 
VR+ = 44 is greater than XR- = 11 indicates that the data are consistent with the directional 
alternative hypothesis H,: © > 5. The question is, however, whether the difference is 
significant — i.e., whether it large enough to conclude that it is unlikely to be the result of 
chance. 

The absolute value of the smaller of the two values 3R-- versus XR- is designated as the 
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Wilcoxon T test statistic. Since XR- = 11 is smaller than © R+ = 44, T= 11. The T value is 
interpreted by employing Table A5 (Table of Critical T Values for Wilcoxon's Signed-Ranks 
and Matched-Pairs Signed-Ranks Tests) in the Appendix. Table A5 lists the critical two- 
tailed and one-tailed .05 and .01 T values in relation to the number of signed ranks in a set of 
data. In order to be significant, the obtained value of T must be equal to or less than the tabled 
critical T value at the prespecified level of significance.? Table 6.3 summarizes the tabled critical 
two-tailed and one-tailed .05 and .01 Wilcoxon T values for n = 10 signed ranks. 


Table 6.3 Tabled Critical Wilcoxon T Values for n = 10 Signed Ranks 


T 05 To 
Two-tailed values 8 3 
One-tailed values 10 5 


Since the null hypothesis can only be rejected if the computed value T = 11 is equal to or 
less than the tabled critical value at the prespecified level of significance, we can conclude the 
following. 

In order for the nondirectional alternative hypothesis H,: 0 + 5 to be supported, it is 
irrelevant whether XxR- > XR- or XR- > R+. In order for the result to be significant, the 
computed value of T must be equal to or less than the tabled critical two-tailed value at the 
prespecified level of significance. Since the computed value T = 11 is greater than the tabled 
critical two-tailed .05 value T, = 8, the nondirectional alternative hypothesis H,: 0 # 5 isnot 
supported at the .05 level. It is also not supported at the .01 level, since T = 11 is greater than 
the tabled critical two-tailed .01 value T,, = 3. 

In order for the directional alternative hypothesis H,: 8 > 5 to be supported, XR-4 must 
be greater than XR-—. Since R+ > XR-, the data are consistent with the directional alternative 
hypothesis H,: © > 5. In order for the result to be significant, the computed value of T must 
be equal to or less than the tabled critical one-tailed value at the prespecified level of 
significance. Since the computed value T = 11 is greater than the tabled critical one-tailed .05 
value T4, = 10, the directional alternative hypothesis is not supported at the .05 level. It is also 
not supported at the .01 level, since T= 11 is greater than the tabled critical one-tailed .01 value 
Ta 75. 

In order for the directional alternative hypothesis H,: 0 < 5 to be supported, the 
following two conditions must be met: a) ER- must be greater than /R+; and b) the computed 
value of T must be equal to or less than the tabled critical one-tailed value at the prespecified 
level of significance. Since the first of these conditions is not met, the directional alternative 
hypothesis H,: © < 5 is not supported. 

A summary of the analysis of Example 6.1 with the Wilcoxon signed-ranks test follows: 
With respect to the median number of times the doctor sees a patient, we can conclude that the 
data do not indicate that the sample of 10 subjects comes from a population with a median value 
other than 5. 

Except for the fact that the mean rather than the median is employed as the population 
parameter stated in the null and alternative hypotheses, Example 2.1 is identical to Example 6.1 
(i.e., the two examples employ the same set of data). Since Example 2.1 states the null hypothesis 
with reference to the population mean, it is evaluated with the single-sample ¢ test. At this point 
we will compare the results of the two tests. When the same data are evaluated with the single- 
sample ¢ test, the null hypothesis can be rejected when the directional alternative hypothesis 
H,: p > 5isemployed, but only at the .05 level. With reference to the latter alternative hypoth- 
esis, the obtained t value exceeds the tabled critical t 4. value by a comfortable margin. When 
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the single-sample f test is employed, the nondirectional alternative hypothesis H,: p * 5 is not 
supported at the .05 level. 

When Example 6.1 is evaluated with the Wilcoxon signed-ranks test, the null hypothesis 
cannot be rejected regardless of which alternative hypothesis is employed. However, when the 
directional alternative hypothesis H,: © > 5 isemployed, the Wilcoxon signed-ranks test falls 
just short of being significant at the .05 level. Directly related to this is the fact that in some 
sources the tabled critical values published for the Wilcoxon test statistic are not identical to the 
values listed in Table A5. These differences are the result of rounding off protocol. The critical 
T values in Table A5 listed for a given level of significance are associated with the probability 
that is closest to but not greater than the value of alpha. In some instances a T value listed in an 
alternative table may be one point higher than the value listed in Table A5, thus making it easier 
to reject the null hypothesis. Although these alternative critical values are actually closer to the 
value of alpha than the values listed in Table A5, the probability associated with a tabled critical 
T value in the alternative table is, in fact, larger than the value of alpha. With reference to 
Example 6.1, the exact probability associated with T = 10 is .0420 (i.e., this represents the 
likelihood of obtaining a T value of 10 orless). The probability associated with T 2 11, which 
is the critical value of T listed in the alternative table, is .0527. Although the latter probability 
is closer to a = .05 than is .0420, it falls above 05. Thus, if one employs the alternative table 
that contains the tabled critical one-tailed .05 value T; = 11, the alternative hypothesis 
H,: 0 > 5 is supported at the .05 level. Obviously in a case such as this where the likelihood 
of obtaining a value equal to or less than the computed value of T is just sightly above .05, it 
would seem prudent to conduct further studies in order to clarify the status of the alternative 
hypothesis H: 0 > 5. 

In the case of Examples 6.1 and 2.1, the results of the Wilcoxon signed-ranks and single- 
sample í test are fairly consistent for the same set of data. Support for the analogous alternative 
hypotheses H,: p > 5 and H,: 0 > 5 is either clearly indicated (in the case of the t test) or 
falls just short of significance (in the case of the Wilcoxon test). The slight discrepancy between 
the two tests reflects the fact that, as a general rule, nonparametric tests are not as powerful as 
their parametric analogs. In the case of the two tests under consideration, the lower power of the 
Wilcoxon signed-ranks test can be attributed to the loss of information which results from 
expressing interval/ratio data in a rank-order format (specifically, rank-ordering the difference 
Scores). As noted earlier, when two or more inferential statistical tests are applied to the same 
set of data and yield contradictory results, it is prudent to replicate the study. In the final 
analysis, replication is the most powerful tool a researcher has at his disposal for determining the 
status of a null hypothesis. 


VI. Additional Analytical Procedures for the Wilcoxon Signed- 
Ranks Test and/or Related Tests 


1. The normal approximation of the Wilcoxon T statistic for large sample sizes If the 
sample size employed in a study is relatively large, the normal distribution can be used to 
approximate the Wilcoxon T statistic. Although sources do not agree on the value of the sample 
size that justifies employing the normal approximation of the Wilcoxon distribution, they gen- 
erally state it should be used for sample sizes larger than those documented in the Wilcoxon table 
contained within the source. Equation 6.2 provides the normal approximation for Wilcoxon T. 
In the equation T represents the computed value of Wilcoxon T, which for Example 6.1 is T= 11. 
n, as noted previously, represents the number of signed ranks. Thus, in our example, n = 10. 
Note that in the numerator of Equation 6.2, the term [n(n  1)]/4 represents the expected value 
of T (often summarized with the symbol T,,), which is defined in Section V. The denominator 
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of Equation 6.2 represents the expected standard deviation of the sampling distribution of the T 
statistic. 








T- n(n + 1) 
T RN o o (Equation 6.2) 
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Although Example 6.1 involves only ten signed ranks (a value most sources would view 
as too small to use with the normal approximation), it will be employed to illustrate Equation 6.2. 
The reader will see that in spite of employing Equation 6.2 with a small sample size, it will yield 
essentially the same result as that obtained when the exact table of the Wilcoxon distribution is 
employed. When the values T = 11 and n = 10 are substituted in Equation 6.2, the value 
z = —1.68 is computed. 
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The obtained value z = -1.68 is evaluated with Table A1 (Table of the Normal 

Distribution) in the Appendix. In Table A1 the tabled critical two-tailed .05 and .01 values 
are Zos = 1.96 and z,, = 2.58, and the tabled critical one-tailed .05 and .01 values are 
= 1.65 and z,, = 2.33. 
Since the smaller of the two values XR versus XR- is selected to represent T, the value 
of z computed with Equation 6.2 will always be a negative number (unless © R+ = XR-, in which 
case z will equal zero). This is the case since, by selecting the smaller value, T will always be 
less than the expected value T,. As a result of this, the following guidelines are employed 
when evaluating the null hypothesis. 

a) If anondirectional alternative hypothesis is employed, the null hypothesis can be rejected 
if the obtained absolute value of z is equal to or greater than the tabled critical two-tailed value 
at the prespecified level of significance. 

b) When a directional alternative hypothesis is employed, one of the two possible 
directional alternative hypotheses will be supported if the obtained absolute value of z is equal 
to or greater than the tabled critical one-tailed value at the prespecified level of significance. 
Which alternative hypothesis is supported depends on the prediction regarding which of the two 
values X£R* versus XR- is larger. The null hypothesis can only be rejected if the directional 
alternative hypothesis that is consistent with the data is supported. 

Employing the above guidelines, when the normal approximation is used with Example 6.1, 
the following conclusions can be reached. 

The nondirectional alternative hypothesis H,: 8 # 5 is not supported. This is the case 
since the computed absolute value z = 1.68 is less than the tabled critical two-tailed .05 value 
Zos = 1.96. This decision is consistent with the decision that is reached when the exact table 
of the Wilcoxon distribution is employed to evaluate the nondirectional alternative hypothesis 
HH; 055, 

The directional alternative hypothesis H,: © > 5 is supported at the .05 level. This is the 
case since the data are consistent with the latter alternative hypothesis (i.e., R+ > XR—), and the 
computed absolute value z = 1.68 is greater than the tabled critical one-tailed .05 value 
= 1.65. The directional alternative hypothesis H,: 0 > 5 is not supported at the .01 level, 
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since the absolute value z = 1.68 is less than the tabled critical one-tailed .01 value z,, = 2.33. 
When the exact table of the Wilcoxon distribution is employed, the directional alternative 
hypothesis H,: 0 5 is not supported at the .05 level. However, it was noted that if an alter- 
native table of Wilcoxon critical values is employed, the alternative hypothesis H,: 0 > 5 is 
supported at the .05 level. 

The directional alternative hypothesis H,: © < 5 is not supported, since the data are not 
consistent with the latter alternative hypothesis (which requires that ER- > XR4-). 

In closing the discussion of the normal approximation, it should be noted that, in actuality, 
either © R+ or XR- can be employed to represent the value of T in Equation 6.2. Either value 
will yield the same absolute value for z. The smaller of the two values will always yield a 
negative z value, and the larger of the two values will always yield a positive z value (which in 
this instance will be z = 1.68 if ©R+ = 44 is employed in Equation 6.2 to represent T). In 
evaluating a nondirectional alternative hypothesis, the sign of z is irrelevant. In the case of a 
directional alternative hypothesis, one must determine whether the data are consistent with the 
alternative hypothesis that is stipulated. If the data are consistent, one then determines whether 
or not the absolute value of z is equal to or greater than the tabled critical one-tailed value at the 
prespecified level of significance. 


2. The correction for continuity for the normal approximation of the Wilcoxon 
signed-ranks test Although not described in most sources, Marascuilo and McSweeney (1977) 
employ a correction factor known as the correction for continuity for the normal approximation 
of the Wilcoxon test statistic. The correction for continuity is recommended by some sources 
for use with a number of nonparametric tests that employ a continuous distribution (such as the 
normal distribution) to estimate a discrete distribution (such as in this instance the Wilcoxon 
distribution). As noted in the Introduction, in a continuous distribution there are an infinite 
number of values a variable may assume, whereas in a discrete distribution the number of 
possible values a variable may assume is limited in number. The correction for continuity is 
based on the premise that if a continuous distribution is employed to estimate a discrete 
distribution, such an approximation will inflate the Type I error rate. By employing the 
correction for continuity, the Type I error rate is ostensibly adjusted to be more compatible with 
the prespecified alpha value designated by the researcher. When the correction for continuity 
is applied to a normal approximation of an underlying discrete distribution, it results in a slight 
reduction in the absolute value computed for z. In the case of the normal approximation of the 
Wilcoxon test statistic, the correction for continuity requires that .5 be subtracted from the 
absolute value of the numerator of Equation 6.2. Thus, Equation 6.3 represents the continuity- 
corrected normal approximation of the Wilcoxon test statistic. 
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If the correction for continuity is employed with Example 6.1, the value of the numerator 
of Equation 6.3 is 16, in contrast to the absolute value of 16.5 computed with Equation 6.2. 
Employing Equation 6.3, the continuity-corrected value z = 1.63 is computed. Note that as a 
result of the absolute value conversion, the numerator of Equation 6.3 will always be a positive 
number, thus yielding a positive z value. 
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Since the absolute value z = 1.63 is less than the tabled critical one-tailed .05 value 
Zos = 1.65, the directional hypothesis H,: © > 5 is not supported. Note that since the 
obtained absolute value z = 1.63 is slightly below the tabled critical one-tailed value zy, = 1.65, 
it is just short of being significant (in contrast to the continuity-uncorrected absolute value z = 
1.68 computed with Equation 6.2, which barely achieves significance at the .05 level). The result 
obtained with z= 1.63 is consistent with that obtained employing the exact table of the Wilcoxon 
distribution. In a case such as this, additional research should be conducted to clarify the status 
of the null hypothesis, since the issue of whether or not to reject it depends on whether or not one 
employs the correction for continuity. 


3. Tie correction for the normal approximation of the Wilcoxon test statistic Equation 6.4 
is an adjusted version of Equation 6.2 that is recommended in some sources (e.g., Daniel (1990) 
and Marascuilo and McSweeney (1977)) when tied difference scores are present in the data. The 
tie correction results in a slight increase in the absolute value of z. Unless there are a substantial 
number of ties, the difference between the values of z computed with Equations 6.2 and 6.4 will 
be minimal. 


T- n(n + 1) 
€ E (Equation 6.4) 
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Table 6.4 illustrates the application of the tie correction with Example 6.1. 


Table 6.4 Correction for Ties with Normal Approximation 


Subject Rank t t? 
4 1 
6 2 
3 3.5 
a 3 
5 3.5 
1 5.5 
| 3 
10 5.5 
2 8 
7 8 | 3 27 
8 8 
9 10 
SES XP = 43 


In the data for Example 6.1 there are three sets of tied ranks: Set 1 involves two subjects 
(Subjects 3 and 5); Set 2 involves two subjects (Subjects 1 and 10); Set 3 involves three subjects 
(Subjects 2, 7, and 8). The number of subjects involved in each set of tied ranks represents the 
values of t in the third column of Table 6.4. The three £ values are cubed in the last column of 
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the table, after which the values Xt and Xf? are computed. The appropriate values are now 
substituted in Equation 6.4.* 
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The absolute value z = 1.69 is slightly larger than the absolute value z = 1.68 obtained 
without the tie correction. The difference between the two methods is trivial, and in this instance, 
regardless of which alternative hypothesis is employed, the decision the researcher makes with 
respect to the null hypothesis is not affected." 

Conover (1980, 1999) and Daniel (1990) discuss and/or cite sources on the subject of 
alternative ways of handling tied difference scores. Conover (1980, 1999) also notes that in 
some instances, retaining and ranking zero difference scores may actually provide a more 
powerful test of an alternative hypothesis than the more conventional method employed in this 
book (which eliminates zero difference scores from the data). 


VII. Additional Discussion of the Wilcoxon Signed-Ranks Test 


1. Power-efficiency of the Wilcoxon signed-ranks test and the concept of asymptotic 
relative efficiency Power-efficiency (also referred to as relative efficiency) is a statistic that 
is employed to indicate the power of two tests relative to one another. It is most commonly used 
in comparing the power of a nonparametric test with its parametric analog. As an example, 
assume we wish to determine the relative power of the Wilcoxon signed-ranks test (designated 
as Test A) and the single-sample f test (designated as Test B). Assume that both tests employ 
the same alpha level with respect to the null hypothesis being evaluated. 

For a fixed power value, the statistic PE,, will represent the power-efficiency of Test A 
relative to Test B. The value of PE,, is computed with Equation 6.5. 


n 
ph 5 (Equation 6.5) 
na 


Where: n} is the number of subjects required for Test A and n, is the number of subjects 
required for Test B, when each test is required to evaluate an alternative hypothesis 
at the same power 


Thus, if the single-sample f test requires 95 subjects to evaluate an alternative hypothesis 
ata power of .80, and the Wilcoxon signed-ranks test requires 100 subjects to evaluate the anal- 
ogous alternative hypothesis at a power of .80, the value of PE,, = 95/100 = .95. From this 
result it can be determined that if 100 subjects are employed to evaluate a null hypothesis with 
the single-sample ¢ test, in order to achieve the same level of power for evaluating the analogous 
null hypothesis with the Wilcoxon signed-ranks test, it is necessary to employ (1/.95)(100) 
= 105 subjects. 

Conover (1980, 1999) notes that the value computed for relative efficiency will be a func- 
tion of the alpha and beta values a researcher employs in analyzing the data for a study. Pitman 
(1948) demonstrated that the value of relative efficiency computed for all possible choices of 
alpha and beta approaches a limiting value as n, approaches infinity. Pitman referred to this 
limiting value as the asymptotic relative efficiency of the two tests (which is often represented 
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by the acronym ARE, and is also referred to as the Pitman efficiency). Since asymptotic 
relative efficiency is a limiting value that is based on a large sample size, it may not be an 
accurate metric of efficiency when the sample size employed in a study is relatively small. 
However, in spite of the latter, for some nonparametric tests, the value computed for asymptotic 
relative efficiency is achieved with a relatively small sample size. The asymptotic relative 
efficiency of a test is of practical value, in that if a researcher is selecting among two or more 
nonparametric tests as an alternative to a parametric test, the nonparametric test with the highest 
asymptotic relative efficiency will allow for the most powerful test of the alternative hypothesis. 


Marascuilo and McSweeney (1977, p. 87) present a table of asymptotic relative efficiency 
values for a variety of nonparametric tests. In the latter table, asymptotic relative efficiency 
values are listed in reference to underlying population distributions with different shapes. In the 
case of the Wilcoxon signed-ranks test, its asymptotic relative efficiency is .955 (when con- 
trasted with the single-sample ¢ test) when the underlying population distribution is normal. For 
population distributions that are not normal, the asymptotic relative efficiency of the Wilcoxon 
signed-ranks test is generally equal to or greater than 1. It is interesting to note that when the 
population distribution is normal, the asymptotic relative efficiency of most nonparametric 
tests will be less than 1. However, when the underlying population is not normal, it is not 
uncommon for a nonparametric test to have an asymptotic relative efficiency greater than 1. As 
a general rule, proponents of nonparametric tests take the position that when a researcher has 
reason to believe that the normality assumption of the single-sample f test has been saliently 
violated, the Wilcoxon signed-ranks test provides a powerful test of the comparable alternative 
hypothesis. 


2. Note on symmetric population concerning hypothesis regarding median and mean 
Conover (1980, 1999) and Daniel (1990) note that if, in fact, the population from which the 
sample is derived is symmetrical, the conclusions one draws with regard to the population median 
are also true with respect to the population mean (since in a symmetrical population the values 
of the mean and median will be identical). This amounts to saying that if in Example 6.1 we 
retain (or reject) Hy: 0 = 5, we are also reaching the same conclusion with respect to the null 
hypothesis Hy: » = 5. There is, however, no guarantee that the results obtained with the 
Wilcoxon signed-ranks test will be entirely consistent with the results derived when the single- 
sample f test is employed to evaluate the same set of data. 


3. Confidence interval for the median difference Conover (1980, 1999) describes a procedure 
(as well as references for alternative procedures) for computing a confidence interval for the 
median difference for a set of n difference scores. 


VIII. Additional Examples Illustrating the Wilcoxon 
Signed-Ranks Test 


With the exception of Examples 1.5 and 1.6, the Wilcoxon signed-ranks test can be employed 
to evaluate a hypothesis about a population median with any of the examples that are employed 
to illustrate the single-sample z test (Test 1) and the single-sample f test. As noted in Section 
I, unless the normality assumption of the aforementioned tests is saliently violated, most re- 
searchers would employ a parametric test in lieu of a nonparametric alternative. In all instances 
in which the Wilcoxon signed-ranks test is employed, difference scores are obtained by sub- 
tracting the hypothesized value of the population median from each score in the sample. All 
difference scores are then ranked and evaluated in accordance with the ranking protocol 
described in Section IV. 
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Example 6.2 (which is a restatement of Example 6.1 with a different set of data) illustrates 
the use of the Wilcoxon signed-ranks test with the presence of zero difference scores. 


Example 6.2 A physician states that the median number of times he sees each of his patients 
during the year is five. In order to evaluate the validity of this statement he randomly selects 13 
of his patients and determines the number of office visits each of them made during the past year. 
He obtains the following values for the 13 patients in his sample: 5, 9, 10, 8, 4, 8, 5, 3, 0, 10, 15, 
9,5. Do the data support his contention that the median number of times he sees a patient is 


five? 


Examination of the data for Example 6.2 reveals that three of the 13 patients visited the 
doctor five times during the year. Since each of these three scores is equal to the hypothesized 
value of the population median, they will all produce difference scores of zero. In employing the 
ranking protocol for the Wilcoxon signed-ranks test, all three of these scores will be eliminated 
from the data analysis. Upon elimination of the three scores, the following ten scores remain: 
9, 10, 8, 4, 8, 3, 0, 10, 15, 9. Since the ten remaining scores are identical to the ten scores 
employed in Example 6.1, the result for Example 6.2 will be identical to that for Example 6.1. 

If, on the other hand, the single-sample f test is employed to evaluate Example 6.2, all 13 
scores are included in the calculations resulting in the value t= 1.87, which is less than the value 
t = 1.94 obtained for Example 2.1 (which employs the 10 scores used in Example 6.1)? The 
point to be made here is that by not employing the zero difference scores, the same T value is 
computed for the Wilcoxon signed-ranks test for both Examples 6.1 and 6.2. Yet in the case 
of the single-sample ¢ test, which employs all 13 scores for the analysis of Example 6.2, the 
computed value of t for the latter example is not the same as the computed value of t for Example 
2.1 (which employs the same data as Example 6.1). Thus, the presence of zero difference scores 
may serve to increase the likelihood of a discrepancy between the results obtained with the 
Wilcoxon signed-ranks test and the single-sample f test. 


Example 6.3 A college English instructor reads in an educational journal that the median 
number of times a student is absent from a class that meets for fifty minutes three times a week 
during a 15 week semester is 0 = 5. During the fall semester she keeps a record of the number 
of times each of the 10 students in her writing class is absent. She obtains the following values: 
9, 10, 8, 4, 8, 3, 0, 10, 15,9. Do the data suggest that the class is representative of a population 
that has a median of 5? 


Since Example 6.3 employs the same data as Example 6.1, it yields the identical result. 
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Endnotes 


1. Some sources note that one assumption of the Wilcoxon signed-ranks test is that the 
variable being measured is based on a continuous distribution. In practice, however, this 
assumption is often not adhered to. 


2. The binomial sign test for a single sample is employed with data that is in the form of a 
dichotomous variable (1.e., a variable represented by two categories). Each subject's score 
is assigned to one of the following two categories: Above the value of the hypothesized popu- 
lation median versus Below the value of the hypothesized population median. The test allows 
a researcher to compute the probability of obtaining the proportion of subjects in each of the 
two categories, as well as more extreme distributions with respect to the two categories. 


3. The Wilcoxon signed-ranks test can also be employed in place of the single-sample z test 
when the value of o is known, but the normality assumption of the latter test is saliently 
violated. 


4. Itis just coincidental in this example that the absolute value of some of the difference scores 
corresponds to the value of the rank assigned to that difference score. 


5. The reader should take note of the fact that no critical values are recorded in Table A5 for 
very small sample sizes. In the event a sample size is employed for which a critical value is 
not listed at a given level of significance, the null hypothesis cannot be evaluated at that level 
of significance. This is the case since with small sample sizes the distribution of ranks will 
not allow one to generate probabilities equal to or less than the specified alpha value. 

6. The term (Xt? - Xt) in Equation 6.4 can also be written as 57 , (t; - t,). The latter nota- 

tion indicates the following: a) For each set of ties, the number of ties in the set is subtracted 

from the cube of the number of ties in that set; and b) the sum of all the values computed in 

a) is obtained. Thus, in the example under discussion (in which there are s = 3 sets of ties): 


Yd) - t) = [23 - 2] + IQ? - 2] + 8P - 3] = 36 
i=l 


The above computed value of 36 is the same as the corresponding value (X1? - f) = 43 
— 7 2 36 computed in Equation 6.4 through use of Table 6.4. 


7. A correction for continuity can be used in conjunction with the tie correction by subtracting 


.5 from the absolute value computed for the numerator of Equation 6.4. Use of the correction 
for continuity will reduce the tie-corrected absolute value of z. 


8. If the single-sample ¢ test is employed with the 13 scores listed for Example 6.2, XX = 91, 
X = 91/13 = 7, and £X? = 815. Thus, $ = /815 - [[91)/13]/(13 - D] = 3.85, 
Sg = 3.85//13 = 1.07, and t = (7 - 5)/1.07 = 1.87. 
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9. Even though the obtained value t = 1.87 is smaller than the value t = 1.94 obtained for 
Example 2.1, it is still significant at the .05 level if one employs the directional alternative 
hypothesis H,: p > 5. This is the case, since for df= 12 the tabled critical one-tailed .05 
value is £4, = 1.78, and t = 1.87 exceeds the latter tabled critical value. 
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Test 7 


The Kolmogorov-Smirnov Goodness-of-Fit 
Test for a Single Sample 
(Nonparametric Test Employed with Ordinal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Does the distribution of n scores that comprise a sample con- 
form to a specific theoretical or empirical population (or probability) distribution? 


Relevant background information on test The Kolmogorov-Smirnov goodness-of-fit test 
for a single sample was developed by Kolmogorov (1933). Daniel (1980) notes that because 
of the similarity between Kolmogorov's test and a goodness-of-fit test for two independent 
samples developed by Smirnov (1939) (the Kolmogorov-Smirnov test for two independent 
samples (Test 13)), the test to be discussed is generally referred to as the Kolmogorov-Smirnov 
goodness-of-fit test for a single sample. 

The Kolmogorov-Smirnov goodness-of-fit test for a single sample is one of a number of 
goodness-of-fit tests discussed in this book. Goodness-of-fit tests are employed to determine 
whether the distribution of scores in a sample conforms to the distribution of scores in a specific 
theoretical or empirical population (or probability) distribution. Goodness-of-fit tests are some- 
what unique when contrasted with other types of inferential statistical tests, in that when con- 
ducting a goodness-of-fit test a researcher generally wants or expects to retain the null hypothesis. 
In other words, the researcher wants to demonstrate that a sample is derived from a distribution 
of a specific type (e.g., a normal distribution). On the other hand, in employing most other 
inferential tests, a researcher wants or expects to reject the null hypothesis — i.e., the researcher 
wants or expects to demonstrate that one or more samples do not come from a specific population 
or from the same population. It should be noted that the alternative hypothesis for a goodness-of- 
fit test generally does not stipulate an alternative distribution that would become the most likely 
distribution for the data if the null hypothesis is rejected. 

Unlike the chi-square goodness-of-fit test (Test 8), which is discussed in the next chapter, 
the Kolmogorov-Smirnov goodness-of-fit test for a single sample is designed to be employed 
with a continuous variable. (A continuous variable is characterized by the fact that a given score 
can assume any value within the range of values that define the limits of that variable.) The chi- 
square goodness-of-fit test, on the other hand, is designed to be employed with nominal/ 
categorical data involving a discrete variable. (A discrete variable is characterized by the fact 
that there are a limited number of values which any score for the variable can assume.) Further 
clarification of the distinction between discrete and continuous variables can be found in the 
Introduction. 

The Kolmogorov-Smirnov goodness-of-fit test for a single sample is categorized as a test 
of ordinal data because it requires that a cumulative frequency distribution be constructed 
(which requires that scores be arranged in order of magnitude). In the Introduction it was noted 
that in a cumulative frequency distribution, the cumulative frequency for a given score 
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represents the frequency of a score plus the frequencies of all scores which are less than that 
score. Scores are arranged ordinally, with the lowest score at the bottom of the distribution, and 
the highest score at the top of the distribution. The cumulative frequency for the lowest score will 
simply be the frequency for that score, since there are no scores below it. On the other hand, the 
cumulative frequency for the highest score will always equal n, the total number of scores in the 
distribution. In some instances a cumulative frequency distribution may present cumulative 
proportions (which can also be expressed as probabilities) or cumulative percentages in lieu of 
and/or in addition to cumulative frequencies. A cumulative proportion or percentage for a given 
Score represents the proportion or percentage of scores that are equal to or less than that score. 
When the term cumulative probability is employed, it means the likelihood of obtaining a given 
score or any score below it (which is numerically equivalent to the cumulative proportion for that 
score). Table 7.1 represents a cumulative frequency distribution for a distribution comprised of 
n — 20 scores. Each of the scores that occur in the distribution are listed in the first column. Note 
that in the third column of Table 7.1, the cumulative frequency values are obtained by adding to 
the frequency of a score in a given row the frequencies of all scores that fall below it. A 
cumulative proportion for a score is obtained by dividing the cumulative frequency of the score 
by n. Acumulative proportion is converted into a cumulative percentage by moving the decimal 
point for the proportion two places to the right. 


Table 7.1 Cumulative Frequency Distribution 


Frequency Cumulative Cumulative Cumulative 

X (f) frequency proportion percentage 
15 3 20 20/20 = 1 100% 
14 2 17 17/20 = .85 85% 
13 2 15 15/202 .75 75% 
12 0 13 13/20 = .65 65% 
11 4 13 13/20 = .65 65% 
10 2 9 9/20= .45 45% 
9 1 7 7/202 .35 35% 
8 0 6 6/20 = .30 30% 
7 4 6 6/20 = .30 30% 
6 2 2 2/20= .10 10% 


n=20 


In the example to be presented for the Kolmogorov-Smirnov goodness-of-fit test for a 
single sample, a cumulative frequency distribution will be constructed. However, the table con- 
taining the cumulative frequency distribution of the test data (Table 7.2) will list the scores in 
reverse order from that listed in Table 7.1 (i.e., in Table 7.2 the lowest score will be at the top 
and the highest score at the bottom). This alternative way of arranging the cumulative frequencies 
is commonly employed to summarize the data analysis for Kolmogorov-Smirnov goodness-of- 
fit test for a single sample. 


II. Example 


Example 7.1 A researcher conducts a study to evaluate whether the distribution of the length 
of time it takes migraine patients to respond to a 100 mg. dose of an intravenously administered 
drug is normal, with a mean response time of 90 seconds and a standard deviation of 35 seconds 
(i.e.,u 2 90 and o 2 35). The amount of time (in seconds) that elapses between the administration 
of the drug and cessation of a headache for 30 migraine patients is recorded below. The 30 
scores are arranged ordinally (i.e., from fastest response time to slowest response time). 
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21, 32, 38, 40, 48, 55, 63, 66, 70, 75, 80, 84, 86, 90, 90, 93, 95, 98, 100, 105, 106, 108, 115, 118, 
126, 128, 130, 142, 145, 155 


Do the data conform to a normal distributions with the specified parameters? 
III. Null versus Alternative Hypotheses 


Prior to reading the null and alternative hypotheses to be presented in this section, the reader 
should be take note of the following: a) The protocol for the Kolmogorov-Smirnov goodness- 
of-fit test for a single sample requires that a cumulative probability distribution be constructed 
for both the sample distribution and the hypothesized population distribution. The test statistic 
is defined by the point that represents the greatest vertical distance at any point between the two 
cumulative probability distributions; and b) Within the framework of the null and alternative 
hypotheses, the notation F(X) represents the population distribution from which the sample dis- 
tribution is derived, while the notation F (X) represents the hypothesized theoretical or empirical 
distribution with respect to which the sample distribution is being evaluated for goodness-of-fit. 
Alternatively, F(X) can be conceptualized as representing the cumulative probability distribution 
for the population from which the sample is derived, and F (X) as the cumulative probability 
distribution for the hypothesized population. 


Null hypothesis Hy: F(X) = F(X) for all values of X 


(The distribution of data in the sample is consistent with the hypothesized theoretical population 
distribution. In terms of the parameters stipulated in Example 7.1, the null hypothesis is stating 
that the sample data are derived from a normal distribution, with u = 90 and o = 35. Another 
way of stating the null hypothesis is as follows: At no point is the greatest vertical distance 
between the sample cumulative probability distribution (which is assumed to be the best estimate 
of the cumulative probability distribution of the population from which the sample is derived) 
and the hypothesized cumulative probability distribution larger than what would be expected by 
chance, if the sample is derived from the hypothesized distribution.) 


Alternative hypothesis — H,: F(X) + F (X) for at least one value of X 


(The distribution of data in the sample is inconsistent with the hypothesized theoretical 
population distribution. In terms of the parameters stipulated in Example 7.1, the null hypothesis 
is stating that the sample data are not derived from a normal distribution, with u = 90 and o = 35. 
An alternative way of stating this alternative hypothesis is as follows: There is at least one point 
where the greatest vertical distance between the sample cumulative probability distribution 
(which is assumed to be the best estimate of the cumulative probability distribution of the 
population from which the sample is derived) and the hypothesized cumulative probability 
distribution is larger than what would be expected by chance, if the sample is derived from the 
hypothesized distribution. At the point of maximum deviation separating the two distributions, 
the cumulative probability for the sample distribution is either significantly greater or less than 
the cumulative probability for the hypothesized distribution. This is a nondirectional 
alternative hypothesis and it is evaluated with a two-tailed test.) 


or 
H,: F(X) > F (X) for at least one value of X 


(The distribution of data in the sample is inconsistent with the hypothesized theoretical population 
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distribution. In terms of the parameters stipulated in Example 7.1, the null hypothesis is stating 
that the sample data are not derived from a normal distribution, with u = 90 and o = 35. The 
latter is the case, since there is at least one point at which the vertical distance between the 
sample cumulative probability distribution (which is assumed to be the best estimate of the 
cumulative probability distribution of the population from which the sample is derived) and the 
hypothesized cumulative probability distribution is larger than what would be expected by 
chance, if the sample is derived from the hypothesized distribution. At the point of maximum 
deviation separating the two distributions, the cumulative probability for the sample distribution 
is significantly greater than the cumulative probability for the hypothesized distribution. This 
is a directional alternative hypothesis and it is evaluated with a one-tailed test.) 


or 
H,: F(X) < F (X) for at least one value of X 


(The distribution of data in the sample is inconsistent with the hypothesized theoretical 
population distribution. In terms of the parameters stipulated in Example 7.1, the null hypothesis 
is stating that the sample data are not derived from a normal distribution, with u = 90 and o = 35. 
The latter is the case, since there is at least one point at which the vertical distance between the 
sample cumulative probability distribution (which is assumed to be the best estimate of the 
cumulative probability distribution of the population from which the sample is derived) and the 
hypothesized cumulative probability distribution is larger than what would be expected by 
chance, if the sample is derived from the hypothesized distribution. At the point of maximum 
deviation separating the two distributions, the cumulative probability for the sample distribution 
is significantly less than the cumulative probability for the hypothesized distribution. This is a 
directional alternative hypothesis and it is evaluated with a one-tailed test.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected. 


IV. Test Computations 


As noted in Sections I and III, the test protocol for the Kolmogorov-Smirnov goodness-of-fit 
test for a single sample requires that the cumulative probability distribution for the sample data 
be contrasted with the cumulative probability distribution for the hypothesized population. Table 
7.2 summarizes the steps that are involved in conducting the analysis. 

The values represented in the columns of Table 7.2 are summarized below. 

The values of the response time scores of the 30 subjects in the sample (i.e., the X scores) 
are recorded in Column A. There are 29 rows corresponding to each of the scores (with two 
subjects having obtained the identical score of 90). 

Each value in Column B is the z score (i.e., a standard deviation score) that results when 
the X score in a given row is substituted in the equation z = (X — u)/o (which is Equation I.27), 
where u = 90 and o = 35. Thus, for each row, the equation z = (X — 90)/35 is employed. To 
illustrate, in the case of Row 1, where X = 21, the value z = (21 — 90)/35 = -1.97 is computed. 
In the case of the last row, where X = 155, the value z = (155 — 90)/35 = 1.86 is computed. Note 
that a negative z value will be obtained for any X score below the mean, and a positive z value 
for any X score above the mean. 

Each value in Column C represents the proportion of cases in the normal distribution that 
falls between the population mean and the z score computed for the X score in a given row (i.e., 
the z value computed in Column B). To illustrate, in the case of Row 1, where X = 21 and 
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Table 7.2 Calculation of Test Statistic for Kolmogorov-Smirnov Goodness-of-Fit 
Test for a Single Sample 


A B C D E F G 

(X) (z) (p (F(X)-px50) SX) IS(X) -FX SX) -F X)| 
21 -1.97  .4756 .0244 1/30 = .0333 .0089 [0—.0244| = .0244 
32  -1.66 .4515 .0485 2/30 = .0667 .0182 .0333-.0485| = .0152 
38 -1.49 4319 .0681 3/30 = .1000 .0319 .0667—.0681| = .0014 
40 -1.43 .4236 .0764 4/30 = .1333 .0569 2 M_|.1000-.0764| = .0236 
48  -1.20 .3849 .1151 5/30 = .1667 .0516 .1333—.1151| 2.0182 
55  -1.00 .3413 .1587 6/30 = .2000 .0413 .1667—.1587| = .0080 
63 —.77 2794 .2206 7/30 = .2333 .0127 .2000-.2206| = .0206 
66 —69  .2549 .2451 8/30 = .2667 .0216 .2333-.2451| = .0118 
70 —57 2157 .2843 9/30 = .3000 .0157 .2667—.2843| = .0176 
75 —43  .1664 .3336 10/30 = .3333 .0003 .3000—.3336| = .0336 
80 —29 141 .3859 11/30 = .3667 .0192 .3333-.3859| = .0526 
84 —17  .0675 .4325 12/30 = .4000 .0325 .3667—.4325| = .0658 
86 —11 .0438 4562 13/30 = .4333 .0229 .4000—-.4562| = .0562 
90 .00 = .0000 5000 15/30 = .5000 .0000 .4333-.5000| = .0667 = M' 
93 .09 | .0359 .5359 16/30 —.5333 .0026 .5000-.5359| = .0359 
95 14.0557 5557 17/30 = .5667 .0110 .5333-.5557| = .0224 
98 .23  .0901 .5901 18/30 = .6000 .0099 .5667—.5901| = .0234 
100 .29 1141 .6141 19/30 = .6333 .0192 .6000-.6141| = .0141 
105 .43  .1664 .6664 20/30 = .6667 .0003 .6333-.6664| = .0331 
106 .46  .1772 .6772 21/30 = .7000 0228 .6667-.6772| = .0105 
108 .51 .1950 .6950 22/30 = .7333 .0383 .7000—.6950| = .0050 
115 71.2611 7611 23/30 = .7667 .0056 .7333-.7611| = .0278 
118 .80 2881 .7881 24/30 = .8000 .0119 .7667—.7881| = .0214 
126 1.03 .3485 .8485 25/30 = .8333 0152 .8000-.8485| = .0485 
128 1.00 .3621 .8621 26/30 = .8667 .0046 .8333-.8621| = .0288 
130 1.14 .3729 8729 27/30 = .9000 0271 .8667—.8729 | = .0062 
142 1.48 .4306 .9306 28/30 = .9333 .0027 .9000—.9306| = .0306 
145 1.57 .4418 9418 29/30 = .9667 .0249 .9333-.9418| = .0085 
155 1.86 .4686 .9686 30/30 = 1.0000 .0314 .9667—.9686| = .0019 








z = —1.97, the proportion .4756 in Column C (which is the entry for z = —1.97 in Column 2 of 
Table A1 (Table of the Normal Distribution) in the Appendix) is the proportion cases in the 
normal distribution between the mean and a z score of —1.97. In the case of the last row, where 
X = 155 and z = 1.86, the proportion .4686 in Column C is the proportion cases in the normal 
distribution between the mean and a z score of 1.86. 

Each value in Column D represents the cumulative proportion for a given X score (and its 
associated z score) in the hypothesized theoretical distribution (i.e., in a normal distribution with 
u = 90 and o = 35). To put it another way, if the decimal point is moved two places to the right, 
the value in Column D represents the percentile rank of a given X score in the hypothesized 
theoretical distribution. For any X score for which a negative z score is computed, the proportion 
in Column D can be obtained by subtracting from .5000 the proportion in Column C for that 
score (it will also correspond to the proportion for that z score in Column 3 of Table A1). For 
any X score for which a positive z score is computed, the proportion in Column D can be 
obtained by adding .5000 to the proportion in Column C for that score (i.e., it will correspond 
to the sum of .5000 and the proportion for that z score in Column 2 of Table A1). To illustrate, 
in the case of Row 1, where X = 21 and z = -1.97, the proportion .0244 is equal to .5000 — .4756 
—.0244. In the case of the last row, where X = 155 and z = 1.86, the proportion .9686 is equal 
to .5000 + .4686 = .9686. The values in Column D are commonly represented by the notation 
F (Xj), where the subscript i represents the i^ score/row in Table 7.2. 
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Each value in Column E represents the cumulative proportion for a given X score (and its 
associated z score) in the sample distribution. To illustrate, in the case of Row 1, where X = 21, 
its cumulative proportion is its cumulative frequency (1) divided by the total number of scores 
in the sample (n = 30). Thus, 1/30 = .0333. In the case of the score X = 100, its cumulative 
proportion in the sample distribution is 19 (i.e., a score of 100 is equal to or greater than 19 of 
the 30 scores). Thus, its cumulative proportion is 19/30 = .6333. In the case of the score X = 155 
in the last row, its cumulative frequency in the sample distribution is 30 (since it is the highest 
score). Thus, its cumulative proportion is 30/30 2 1. The values in Column E are commonly 
represented by the notation S(X;), where the subscript i represents the i "' score/row in Table 7.2. 

Each value in Column F is the absolute value of the difference between the proportions in 
Column E and Column D — in other words, the difference between the proportions in the 
sample distribution and the hypothesized population distribution. Thus, F, = |E, - D,| or 
F, = |S(X,) - F,(X;)|. To illustrate, in the case of Row 1, where D; = F (X) = .0244 and 
E, = S(X) = .0333, we compute the value F, = .0089 as follows. 


F, = |E, - D,| = |S(X) - FX)| = |.0333 - .0244| = .0089 


In the case of the row where X 2100, and where D, = F(X) = .6141 and E, = S(X,) 
= .6333, we compute the value F = .0192 as follows: 


F,- |E, - Dj = |S(X) - F,(X)| = |.6333 - .6141| = .0192 


In the case of the last row where X 2155, and where D; = F(X) = .9686 and E, = S(X,) 
- ], we compute the value F; - .0314 as follows: 


F,- |E, - Dj = |S(X) - F (X)| = |1 - .9686| = .0314 


As noted in Section III, the test statistic for the Kolmogorov-Smirnov goodness-of-fit test 
for a single sample is defined by the greatest vertical distance at any point between the two 
cumulative probability distributions. The largest absolute value obtained in Column F will 
represent that value. In Table 7.2 the largest absolute value is .0569, which is designated as the 
test statistic (represented by the notation M). 

Each value in Column G is the absolute value of the difference between the proportion in 
Column D for a given row (i.e., F (X;)) and the proportion in Column E for the preceding row 
(i.e., S(X,_,)). In other words, G; = |E; , - D,| or G; = |S(X,_,) - F,(X,)|. To illustrate, in 
the case of Row 1, where D; = F (X) = .0244 and E, = S(X,_,) = 0,wecomputethevalueG, = .0244 
as noted below. Note that the value 0 is employed to represent the initial value of E; = S(X,_,), 
since that is the value that .0333 is added to in order to get the entry .0333 in Row 1 of Column 
E. 


G; = |E; - D,| = |S(X;.) - F,&)| = |0 - .0244| = .0244 


In the case of the row where X =100, and where D; = F(X) = .6141 and E; , = S(X,_,) 
= .6000, we compute the value G, = .0141 as follows: 


G; = |E , - Dj| = |S(X; ) - FX)| = |.6000 - .6141| = .0141 


In the case of the last row where X =155, and where D, = F(X) = .9686 and E; , = 
S(X; ;) = .9667, we compute the value G, = .0019 as follows: 
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G; = |E; - D,| = [SX ) - F,(X)| = |.9667 - .9686| = .0019 


As noted above, the test statistic for the Kolmogorov-Smirnov goodness-of-fit test for 
a single sample is defined by the greatest vertical distance at any point between the two 
cumulative probability distributions. However, when that value is determined mathematically 
through use of the value in Column F, it is still possible that the largest vertical distance may 
occur at some point between one of the scores in the sample distribution. Since it is assumed 
that the variable being evaluated is continuous, if there is a larger vertical distance for some score 
other than those in the sample, the latter score should represent the test statistic, instead of the 
M value recorded in Column F. The method for determining whether there is a larger vertical 
distance than the maximum value recorded in Column F is to compute the values in Column G 
of Table 7.2. If the largest value computed in Column G (designated M’) is larger than the M 
value computed in Column F, then M' is employed to represent the test statistic. In Table 7.2, 
the largest value is .0667, and thus M’ = .0667 becomes our test statistic.! 

An alternative method for determining the largest vertical distance is to draw a graph which 
depicts: a) The curve of the hypothesized cumulative probability distribution, and b) The points 
that represent the cumulative probabilities for the sample distribution (which if connected would 
result in a curve of the cumulative probabilities for the sample distribution). Through use of such 
a graph a determination can be made with respect to whether there is a larger vertical distance 
at some point on the two cumulative probability distributions than the distance/value computed 
for M in Column F. The graphical method is described in Conover (1980, 1999), Daniel (1990), 
and Sprent (1993). If such a graph is constructed for Example 7.1, most of the points for the 
cumulative sample distribution fall above the curve of the hypothesized cumulative probability 
distribution (with the value M = .0569 representing the vertical distance of one of those points 
above the latter curve). However, some of the points, including the one resulting in the value 
M' = .0667, fall below the curve of the hypothesized cumulative probability distribution. 


V. Interpretation of the Test Results 


The test statistic for the Kolmogorov-Smirnov goodness-of-fit test for a single sample is the 
larger of the two values M or M'. The test statistic is evaluated with Table A21 (Table of 
Critical Values for the Kolmogorov-Smirnov Goodness-of-Fit Test for a Single Sample) in 
the Appendix. If at any point along the two cumulative probability distributions the greatest 
distance (i.e., the larger of the two values M or M^) is equal to or greater than the tabled critical 
value recorded in Table A21, the null hypothesis is rejected. The critical values in Table A21 
are listed in reference to sample size. For n = 30, the tabled critical two-tailed .05 and .01 values 
in are M, = .242 and M,, = .290, and the tabled critical one-tailed .05 and .01 values are 
My = -218 and M,, = 270. 

The following guidelines are employed in evaluating the null hypothesis for the 
Kolmogorov-Smirnov goodness-of-fit test for a single sample. 

a) If the nondirectional alternative hypothesis H,: F(X) + F(X) is employed, the null 
hypothesis can be rejected if the computed value of the test statistic is equal to or greater than the 
tabled critical two-tailed M value at the prespecified level of significance. 

b) If the directional alternative hypothesis H,: F(X) > F(X) is employed, the null 
hypothesis can be rejected if the computed value of the test statistic is equal to or greater than the 
tabled critical one-tailed M value at the prespecified level of significance. Additionally, the 
difference between the two cumulative probability distributions must be such that, in reference 
to the point that represents the test statistic, the cumulative probability associated with the sample 
distribution is larger than the cumulative probability associated with the hypothesized population 
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distribution. In other words, if, instead of computing an absolute value in Columns F and G of 
Table 7.2, we retain the sign of the difference, then a positive sign is required for the directional 
alternative hypothesis H,: F(X) > F(X) to be supported. Thus, if M is the largest vertical 
distance, S(X) > F (Xj), and if M’ is the largest vertical distance, S(X,_,) > F (X). 

c) If the directional alternative hypothesis H,: F(X) < F(X) is employed, the null 
hypothesis can be rejected if the larger of the two values M versus M' is equal to or greater than 
the tabled critical one-tailed M value at the prespecified level of significance. Additionally, the 
difference between the two cumulative probability distributions must be such that in reference 
to the point that represents the test statistic, the cumulative probability associated with the sample 
distribution is smaller than the cumulative probability associated with the hypothesized 
population distribution. In other words, if, instead of computing an absolute value in Columns 
F and G of Table 7.2, we retain the sign of the difference, then a negative sign is required for the 
directional alternative hypothesis H,: F(X) < F(X) to be supported. Thus, if M is the largest 
vertical distance, S(X,) < F (X), and if M’ is the largest vertical distance, S(X,_,) < F (Xj). 

The above guidelines will now be employed in reference to the computed test statistic 
M' = .0667. 

a) If the nondirectional alternative hypothesis H,: F(X) + F(X) is employed, the null 
hypothesis cannot be rejected, since M' = .0667 is less than the tabled critical two-tailed values 
My = 242 and My, = .290. 

b) If the directional alternative hypothesis H,: F(X) < F(X) is employed, the null 
hypothesis cannot be rejected since M' = .0667 is less than the tabled critical one-tailed 
values My), = -218 and My, = .270. This is the case in spite of the fact that the test statistic 
is consistent with the latter alternative hypothesis (i.e., since [F (X) = 5000] > [S(X,_,) 
= .4333], if the sign is taken into account, the computed value of M' is a negative value: 
M'-S(X,) - FX) = .4333 - .5000 = -.0667). 

c) If the directional alternative hypothesis H,: F(X) > F(X) is employed, the null 
hypothesis cannot be rejected, since, for the latter alternative hypothesis to be supported, the 
cumulative proportion for the sample distribution must be larger than the cumulative proportion 
for the hypothesized population distribution. 

A summary of the analysis of Example 7.1 with the Kolmogorov-Smirnov goodness-of-fit 
test for a single sample follows: The data are consistent with the null hypothesis that the sample 
is derived from a normally distributed population, with u = 90 and o = 35. 


VI. Additional Analytical Procedures for the Kolmogorov-Smirnov 
Goodness-of-Fit Test for a Single Sample 


1. Computing a confidence interval for the Kolmogorov-Smirnov goodness-of-fit test for 
asingle sample Daniel (1990) describes how to construct a confidence interval for the cumu- 
lative distribution for the sample proportions? The confidence interval computed for the 
Kolmogorov-Smirnov test statistic is comprised of two sets of limits — a set of upper limits 
and a set of lower limits. The reference points for determining the latter values are the S(X;) 
scores in Column E of Table 7.2. Equation 7.1 is the general equation for computing the limits 
that define a confidence interval at any point along the cumulative probability distribution for the 
sample. 


CI 


a-y = SX) + (M) (Equation 7.1) 


Where: M p represents the tabled critical two-tailed M value for a given value of n, below 


which a proportion (percentage) equal to [1 — (0/2)] of the cases falls. If the 
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proportion (percentage) of the distribution that falls within the confidence interval is 
subtracted from 1 (10096), it will equal the value of a. 


The upper limits for the confidence interval are computed by adding the relevant critical 
value to each of the of values of S(X,) in Column E of Table 7.2. If any of the resulting values 
is greater than 1, the upper limit for that S(X,) value is set equal to 1, since a probability cannot 
be greater than 1. In the case of Example 7.1, 29 upper limit values will be computed, each 
value corresponding to one of the 29 S(X,) values recorded in the rows of Table 7.2. 

The lower limits for the confidence interval are computed by subtracting the relevant 
critical value from each of the values of S(X;) in Column E of Table 7.2. If any of the resulting 
values is less than 0, the lower limit for that S(X,) value is set equal to 0, since a probability 
cannot be less than 0. In the case of Example 7.1, 29 lower limit values will be computed, each 
value corresponding to one of the 29 S(X,) values recorded in the rows of Table 7.2. 

The above methodology will now be described in reference to Example 7.1. Let us assume 
we wish to compute a 9596 confidence interval for the cumulative probability distribution of the 
population from which the sample is derived. Since we are interested in the 9596 confidence 
interval, the value that will be employed for M „p in Equation 7.1 will be the tabled critical two- 
tailed .05 M value, which as previously noted is M,, = .242. Thus, we will add to and subtract 
.242 from each of the S(X,) values in Column E of Table 7.2. 

To illustrate, the first S(X,) value (associated with the score of X = 21) is .0333. When 
.242 is added to the latter value we obtain .2753, which is the upper limit for that point on the 
cumulative probability distribution. When .242 is subtracted from .0333 we obtain the value 
—.2087. Since the latter value is less than zero, we set the lower limit at that point equal to zero. 

In the case of the score of X = 90, the value of S(X,) is 5000. When .242 is added to the 
latter value we obtain .7420, which is the upper limit for that point on the cumulative probability 
distribution. When .242 is subtracted from .5000 we obtain .2580, which is the lower limit for 
that point on the cumulative probability distribution. 

In the case of the score of X = 155, the value of S(X,) is 1. When .242 is added to the latter 
value we obtain 1.242. Since the latter value is greater than 1, we set the upper limit at that point 
equal to 1. When .242 is subtracted from 1 we obtain .7580, which is the lower limit for that 
point on the cumulative probability distribution. 

As noted earlier, the above described procedure is employed for all 29 points on the 
cumulative probability distribution for the sample. The resulting set of upper and lower limits 
defines the confidence interval. 


2. The power of the Kolmogorov-Smirnov goodness-of-fit test for a single sample Books 
thatdiscuss the Kolmogorov-Smirnov goodness-of-fit test for a single sample do not describe 
specific procedures for computing the power of the test. Conover (1980, 1999), Daniel (1990), 
and Hollander and Wolfe (1999) cite sources that discuss the power of the test and/or describe 
procedures for determining power. Daniel (1980) and Zar (1999) note that when the 
Kolmogorov-Smirnov goodness-of-fit test for a singlesample is employed with grouped data 
(i.e., scores are categorized in class intervals instead of evaluating each score separately), the test 
becomes overly conservative (i.e., the power of the test is reduced). Zar (1999) presents a 
correction factor for the Kolmogorov-Smirnov test statistic, endorsed by Harter et al. (1984) 
and Khamis (1990), which can increase the power of the test. Zar (1999) also states that the 
Kolmogorov-Smirnov test is more powerful than the chi-square goodness-of-fit test under the 
following conditions: a) When the sample size is small; and b) When the expected frequencies 
for the chi-square test are small. 
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3. Test 7a: The Lilliefors test for normality Massey (1951) notes that when the population 
parameters (e.g., u and o) are not known beforehand, but are instead estimated from the sample 
data, the result yielded by the Kolmogorov-Smirnov goodness-of-fit test for a single sample 
tends to be overly conservative (i.e., the statistical power of the test is less than its power when 
the values of the parameters are known). Various sources (Conover (1980, 1999) and Daniel 
(1990)) describe Lilliefors (1967, 1969, 1973) extension of the Kolmogorov-Smirnov 
goodness-of-fit test for a single sample to circumstances in which the values of the population 
parameters for a variety of distributions (e.g., normal, exponential, gamma) and not known, and 
thus have to be estimated from the sample data.? The procedure to be described here, which is 
designed to assess goodness-of-fit for a normal distribution when one or both of the population 
parameters u and o are unknown, is referred to as the Lilliefors test for normality. The test 
procedure for the Lilliefors test for normality is identical to that described for the 
Kolmogorov-Smirnov goodness-of-fit test for a single sample, except for the following: a) 
The values of the sample mean (X) and estimated population standard deviation ($) are 
employed to represent the mean and standard deviation of the hypothesized population 
distribution. Thus, the values of X and $ are employed to compute the z values in Column B 
of Table 7.2; and b) Instead of obtaining the critical values from Table A21, the values 
documented in Table A22 (Table of Critical Values for the Lilliefors Test for Normality) in 
the Appendix are employed.’ As is the case with employing the critical values in Table A21, 
in order to reject the null hypothesis when the test statistic is based on the Lilliefors test for 
normality, the computed value for M or M' (i.e., which ever of the two is larger) must be equal 
to or greater than the tabled critical value in Table A22 at the prespecified level of significance. 
The values recorded in Table A22 are only applicable when both the values of u and o are 
unknown, and must be estimated from the sample data. 

Table 7.3 reevaluates the data for Example 7.1 employing the values for the sample 
mean (X = 90.07) and estimated population standard deviation ( = 34.79) in place of the 
values u = 90 and o = 35 employed for the Kolmogorov-Smirnov goodness-of-fit test for a 
single sample. The values X - 90.07 and $ - 34.79 were computed by employing Equations 
I.1 and L8 with the 30 scores in Example 7.1. 

In Table 7.3 the computed values for M and M' in Columns F and G are M = .0594 and 
M' = .0667. Since the values X = 90.07 and § = 34.79 are quite close to the values u = 90 
and o = 35 employed for the Kolmogorov-Smirnov goodness-of-fit test for a single sample, 
it is not surprising that the values in the rows of Table 7.3 are quite close and, in some cases, 
identical to the values in the rows of Table 7.2. The values M = .0594 and M' = .0667 obtained 
in Table 7.3 are either very close or identical to the values M = .0569 and M' = .0667 obtained 
in Table 7.2 for the Kolmogorov-Smirnov goodness-of-fit test for a single sample. Since 
M' = .0667 is larger than M = .0594, M’ = .0667 will represent the test statistic for the Lilliefors 
test for normality. 

Asis the case with Table A21, the critical values listed in Table A22 are listed in reference 
to sample size. Lilliefors' (1967) table only contains two-tailed .40, .30, .20, .10, and .02 values, 
and one-tailed .20, .15, .10, .05, and .01 values. Daniel (1990) employs more detailed tables for 
the Lilliefors test statistic developed by Mason and Bell (1986). The latter tables have slightly 
different critical values than those listed in Table A22. Mason and Bell’s (1986) tables also have 
additional critical values for when u is unknown and o is known, and for when u is known and 
o is unknown. 

Employing Table A22 for n = 30, it can be seen that the tabled critical one-tailed .05 and 
.01 values (which correspond to the two-tailed .10 and .02 critical values) are M; = .161 and 
My, = .187. Since M’ = .0667 is less than both of the aforementioned critical values, the null 
hypothesis of normality cannot be rejected. From the magnitude of the critical values, it is 
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Table 7.3 Calculation of Test Statistic for the Lilliefors Test for Normality 


A B C D E F G 

(X) (z) (p (F(X)-px.50) SX) IS(X) - FX) SX; - FX) | 
21 -1.99 .4761 .0239 1/30 = .0333 .0094 [0-.0239| = .0239 
32  -1.67 .4525 .0475 2/30 = .0667 .0192 .0333—.0475| = .0142 
38  -1.50 .4332 .0668 3/30 =.1000  .0332 .0667—.0668| = .0001 
40 -1.44 .4251 .0749 4/30 = .1333 .0594 2 M |.1000-—.0749| = .0251 
48  -1.21 .3869 .1131 5/30 2.1667 .0536 .1333-.1131| = .0202 
55 -1.01 .3438 .1562 6/30 =.2000 . .0438 .1667—.1562| = .0105 
63 —78  .2823 .2177 7/30 = .2333 .0156 .2000-.2177| = .0177 
66 -.69 2549 .2451 8/30 = .2667 .0216 .2333-.2451| = .0118 
70 —58 2190 .2810 9/30 =.3000 . .0190 .2667—.2810| = .0143 
75 —43 .1664 3336 10/30 = .3333 .0003 .3000—.3336| = .0336 
80 —29 .1141 3859 11/30 = .3667 .0192 .3333-.3859| = .0526 
84 —17  .0675 .4325 12/30 2.4000 . .0325 .3667—.4325| = .0658 
86 —12 .0478 .4522 13/30 = .4333 .0189 .4000—-.4522| = .0522 
90 .00 = .0000 5000 15/30 2.5000 . .0000 .4333-.5000| = .0667 = M' 
93 .08 .0319 .5319 16/30 = .5333 .0014 .5000-.5319| = .0319 
95 14.0557 5557 17/30 = .5667 .0110 .5333—.5557| = .0224 
98 .23  .0901 5901 18/30 =.6000  .0099 .5667—.5901| = .0234 
100 9 1141 6141 19/30 = .6333 .0192 .6000-.6141| = .0141 
105 .43  .1664 .6664 20/30 = .6667 .0003 .6333-.6664| = .0331 
106 46 1772 .6772 21/30 =.7000 =.0228 .6667—.6772| = .0105 
108 .52  .1985 .6985 22/30 = .7333 .0348 .7000-.6985| = .0015 
115 .12 .2642 7642 23/30 = .7667 .0025 .7333-.7642| = .0309 
118 .80 2881 .7881 24/30 —.8000 . .0119 .7667—.7881| = .0214 
126 1.03 .3485 8485 25/30 = .8333 .0152 .8000-.8485| = .0485 
128 1.00 .3621 .8621 26/30 = .8667 .0046 .8333-.8621| = .0288 
130 1.15  .3749 8749 27/30 =.9000 = .0251 .8667—.8749 | = .0082 
142 1.49 .4319 9319 28/30 = .9333 .0014 .9000-.9319| = .0319 
145 1.58 .4429 9429 29/30 = .9667 .0238 .9333-.9429| = .0096 
155 1.87  .4693 .9693 30/30 = 1.0000  .0307 .9667—.9693| = .0026 








obvious that if more detailed tables were available listing the two-tailed .05 and .01 critical 
values, the latter values would be greater than the value M’ = .0667, and thus the result would 
not be significant. Since the test statistic is interpreted in the same way as the 
Kolmogorov-Smirnov test statistic, the conclusion drawn from the Lilliefors test for 
normality is identical to that reached with the Kolmogorov-Smirnov test. Thus, the null 
hypothesis of normality is retained. 


VII. Additional Discussion of the Kolmogorov-Smirnov 
Goodness-of-Fit Test for a Single Sample 


1. Effect of sample size on the result of a goodness-of-fit test Conover (1980, 1999) notes 
that if one employs a large enough sample size, almost any goodness-of-fit test will result in 
rejection of the null hypothesis. In view of the latter, Conover (1980, 1999) states that in order 
to conclude on the basis of a goodness-of-fit test that data conform to a specific distribution, the 
data should be reasonably close to the specifications of the distribution. Thus, in some cases 
where a large sample size is involved, a researcher may end up rejecting the null hypothesis of 
goodness-of-fit for a hypothesized distribution, yet in spite of the latter, if the sample data are 
reasonably close to the hypothesized distribution, one can probably operate on the assumption 
that the sample data provide an adequate fit for the hypothesized distribution. 
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2. The Kolmogorov-Smirnov goodness-of-fit test for a single sample versus the chi-square 
goodness-of-fit test and alternative goodness-of-fit tests Daniel (1980) discusses the relative 
merits of employing the Kolmogorov-Smirnov goodness-of-fit test for a single sample for 
assessing goodness-of-fit versus the chi-square goodness-of-fit test. In his discussion Daniel 
(1980) notes the following: a) Whereas the Kolmogorov-Smirnov test is designed for use with 
continuous data, the chi-square goodness-of-fit test is designed to be used with discrete data; 
b) The Kolmogorov-Smirnov test is able to evaluate a one-tailed hypothesis regarding 
goodness-of-fit, while the chi-square test is not suited for such an analysis; c) The 
Kolmogorov- Smirnov test allows for the computation of a confidence interval for the 
cumulative population distribution the sample represents, whereas the chi-square test does not 
allow the latter computation; d) Since the chi-square test groups data into categories/class 
intervals, it does not use as much information as the Kolmogorov-Smirnov test, which 
generally evaluates each score separately; and e) The chi-square test provides an approximation 
of an exact sampling distribution (the multinomial distribution), whereas the sampling 
distribution employed by the Kolmogorov-Smirnov test is exact. 

Although the Kolmogorov-Smirnov test for a single sample and the chi-square 
goodness-of-fit test are the most commonly employed (as well as discussed) tests for goodness- 
of-fit, a number of other goodness-of-fit tests have been developed (including the following 
which are described in this book: The single sample test for evaluating population skewness 
(Test 4), the single sample test for evaluating population kurtosis (Test 5), and the 
D'Agostino- Pearson test of normality (Test 5a). Among the other goodness-of-fit tests that 
are described and/or discussed in nonparametric statistics books are David's empty cell test 
(David (1950)), the Cramér-von Mises goodness-of-fit test (attributed to Cramér (1928), von 
Mises (1931), and Smirnov (1936)), and the Shapiro-Wilk test for normality (Shapiro and 
Wilk (1965, 1968)) (which is described in Conover (1980, 1999)). Daniel (1990) contains a 
comprehensive discussion of alternative goodness-of-fit procedures. 

The general subject of goodness-of-fit tests for randomness is discussed in Section VII of 
the single-sample runs tests (Test 10). Autocorrelation, which a procedure that can also be 
employed for assessing goodness-of-fit for randomness, is discussed in Section VII of the 
Pearson product-moment correlation coefficient. 


VIII. Additional Example Illustrating the Kolmogorov-Smirnov 
Goodness-of-fit Test for a Single Sample 


Example 7.2 The results of an intelligence test administered to 30 students are evaluated with 
respect to goodness-of-fit for a normal distribution with the following parameters: u = 90 and 


6 —35. The IQ scores of the 30 students are noted below. 


21, 32, 38, 40, 48, 55, 63, 66, 70, 75, 80, 84, 86, 90, 90, 93, 95, 98, 100, 105, 106, 108, 115, 118, 
126, 128, 130, 142, 145, 155 


Do the data conform to a normal distributions with the specified parameters? 
Since Example 7.2 employs the same data as Example 7.1, it yields the identical result. 
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Endnotes 


1. a) Marascuilo and McSweeney (1977) employ a modified protocol that can result in a larger 
absolute value for M in Column F or M' in Column G than the one obtained in Table 7.2. 
The latter protocol employs a separate row in the table for each instance in which the same 
score occurs more than once in the sample data. If the latter protocol were employed in Table 
7.2, there would be two rows in the table for the score of 90 (which is the only score that 
occurs more than once). The first 90 would be recorded in Column A in a row that has a 
cumulative proportion in Column E equal to 14/30 = .4667. The second 90 would be 
recorded in the following row in Column A with a cumulative proportion in Column E equal 
to 15/30 = .5000. In the case of Example 7.1, the outcome of the analysis would not be 
affected if the aforementioned protocol is employed. In some instances, however, it can 
result in a different/larger M or M' value. The protocol employed by Marascuilo and 
McSweeney (1977) is employed by sources who argue that when there are ties present in the 
data (i.e., a score occurs more than once), the protocol described in this chapter (which is 
used in most sources) results in an overly conservative test (1.e., makes it more difficult to 
reject a false null hypothesis); b) It is not necessary to compute the values in Column G if 
a discrete variable is being evaluated. Conover (1980, 1999) and Daniel (1990) discuss the 
use of the Kolmogorov-Smirnov goodness-of-fit test for a single sample with discrete 
data. Studies cited in the latter sources indicate that when the Kolmogorov-Smirnov test 
is employed with discrete data, it yields an overly conservative result (i.e., the power of the 
test 1s reduced). 


2. A general discussion of confidence intervals can be found in Section VI of the single sample 
t test (Test 2). 


3. The gamma and exponential distributions are continuous probability distributions. 
4. Table A22 is only appropriate for assessing goodness-of-fit for a normal distribution. 


Lilliefors (1969, 1973) has developed tables for other distributions (e.g., the exponential and 
gamma distributions). 
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Test 8 
The Chi-Square Goodness-of-Fit Test 


(Nonparametric Test Employed with Categorical/Nominal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test In the underlying population represented by a sample are the 
observed cell frequencies different from the expected cell frequencies? 


Relevant background information on test The chi-square goodness-of-fit test, also referred 
to as the chi-square test for a single sample, is employed in a hypothesis testing situation in- 
volving a single sample. Based on some preexisting characteristic or measure of performance, 
each of n observations (subjects/objects) that is randomly selected from a population consisting 
of N observations (subjects/objects) is assigned to one of k mutually exclusive categories.' The 
data are summarized in the form of a table consisting of k cells, each cell representing one of the 
k categories. Table 8.1 summarizes the general model for the chi-square goodness-of-fit test. 
In Table 8.1, C; represents the 7  cell/category and O, represents the number of observations 
in the i” cell. The number of observations recorded in each cell of the table is referred to as the 
observed frequency of a cell. 


Table 8.1 General Model for Chi-Square Goodness-of-Fit Test 


Total number of 
observations 


Cell/Category Can GG 


Observed frequency O, 0, = 0;  O; m 


The experimental hypothesis evaluated with the chi-square goodness-of-fit test is whether 
or not there is a difference between the observed frequencies of the k cells and their expected 
frequencies (also referred to as the theoretical frequencies). The expected frequency of a cell 
is determined through the use of probability theory or is based on some preexisting empirical 
information about the variable under study. If the result of the chi-square goodness-of-fit test 
is significant, the researcher can conclude that in the underlying population represented by the 
sample there is a high likelihood that the observed frequency for at least one of the k cells is not 
equal to the expected frequency of the cell. It should be noted that, in actuality, the test statistic 
for the chi-square goodness-of-fit test provides an approximation of a binomially distributed 
variable (when k = 2) and a multinomially distributed variable (when k » 2). The larger the value 
of n, the more accurate the chi-square approximation of the binomial and multinomial dis- 
tributions.’ 

The chi-square goodness-of-fit test is based on the following assumptions: a) Categorical/ 
nominal data are employed in the analysis. This assumption reflects the fact that the test data 
should represent frequencies for k mutually exclusive categories; b) The data that are evaluated 
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consist of a random sample of n independent observations. This assumption reflects the fact that 
each observation can only be represented once in the data; and c) The expected frequency of each 
cell is 5 or greater. When this assumption is violated, it is recommended that if k = 2, the 
binomial sign test for a single sample (Test 9) be employed to evaluate the data. When the 
expected frequency of one or more cells is less than 5 and k > 2, the multinomial distribution 
should be employed to evaluate the data. The reader should be aware of the fact that sources are 
not in agreement with respect to the minimum acceptable value for an expected frequency. Many 
sources employ criteria suggested by Cochran (1952), who stated that none of the expected 
frequencies should be less than 1 and that no more than 20% of the expected frequencies should 
be less than 5. However, many sources suggest the latter criteria may be overly conservative. 
In the event that a researcher believes that one or more expected cell frequencies are too small, 
two or more cells can be combined with one another to increase the values of the expected fre- 
quencies. The latter procedure is demonstrated and discussed in Section VI. 

Zar (1999, p. 470) provides an interesting discussion on the issue of the lowest acceptable 
value for an expected frequency. Within the framework of his discussion, Zar (1999) cites 
studies indicating that when the chi-square goodness-of-fit test is employed to evaluate a 
hypothesis regarding a uniform distribution, the test is extremely robust. A robust test is one 
that still provides reliable information in spite of the fact that one or more of its assumptions have 
been violated. A uniform distribution (also referred to as a rectangular distribution) is one 
in which each of the possible values a variable can assume has an equal likelihood of occurring. 
In the case of an analysis involving the chi-square goodness-of-fit test, a distribution is uniform 
if each of the cells has the same expected frequency. 


II. Examples 


Two examples will be employed to illustrate the use of the chi-square goodness-of-fit test. 
Since the two examples employ identical data, they will result in the same conclusions with 
respect to the null hypothesis. 


Example 8.1 A die is rolled 120 times in order to determine whether or not it is fair (unbiased). 
The value 1 appears 20 times, the value 2 appears 14 times, the value 3 appears 18 times, the 
value 4 appears 17 times, the value 5 appears 22 times, and the value 6 appears 29 times. Do 
the data suggest that the die is biased? 


Example 8.2 A librarian wishes to determine if it is equally likely that a person will take a book 
out of the library each of the six days of the week the library is open (assume the library is closed 
on Sundays). She records the number of books signed out of the library during one week and 
obtains the following frequencies: Monday, 20; Tuesday, 14; Wednesday, 18; Thursday, 1T; 
Friday, 22; and Saturday, 29. Assume no person is permitted to take out more than one book 
during the week. Do the data indicate there is a difference with respect to the number of books 
taken out on different days of the week? 


III. Null versus Alternative Hypotheses 


In the statement of the null and alternative hypotheses, the lower case Greek letter omicron (0) 
is employed to represent the observed frequency of a cell in the underlying population, and the 
lower case Greek letter epsilon (£) is employed to represent the expected frequency of the cell 
in the population. Thus, o; and €,, respectively, represent the observed and expected frequency 
of the i” cell in the underlying population. With respect to the observed and expected fre- 
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quencies for the sample data, the notation O; is employed to represent the observed frequency 
of a cell, and E, the expected frequency of a cell. 


Null hypothesis H): 0; = €, for all cells. 


(In the underlying population the sample represents, for each of the k cells, the observed fre- 
quency of a cell is equal to the expected frequency of the cell. With respect to the sample data 
this leads to the prediction that for all k cells O, = E,.) 


Alternative hypothesis H;: 0, # £, for at least one cell. 


(In the underlying population the sample represents, for at least one of the k cells the observed 
frequency of a cell is not equal to the expected frequency of the cell. With respect to the sample 
data this leads to the prediction that for at least one cell O, + E,. The reader should take note 
of the fact that the alternative hypothesis does not state that in order to reject the null hypothesis 
there must be a discrepancy between the observed and expected frequencies of all k cells. 
Rejection of the null hypothesis can be the result of a discrepancy between the observed and 
expected frequencies for one cell, two cells, ..., (k — 1) cells, or all k cells. As a general rule, 
sources always state the alternative hypothesis for the chi-square goodness-of-fit test 
nondirectionally. Although the latter protocol will be adhered to in this book, in actuality it is 
possible to state the alternative hypothesis directionally. The issue of the directionality of 
alternative hypothesis is discussed in Section VII.) 


IV. Test Computations 
Table 8.2 summarizes the data and computations for Examples 8.1 and 8.2. 


Table 8.2 Chi-square Summary Table for Examples 8.1 and 8.2 


2 (0; m EY 
Cell O, E, (O0, - E) (0; - E) EE 
1/Monday 20 20 0 0 0 
2/Tuesday 14 20 -6 36 1.8 
3/Wednesday 18 20 -2 4 2 
4/Thursday 17 20 -3 9 .45 
5/Friday 22 20 3 4 E. 
6/Saturday 29 20 9 81 4.05 
X0,-120 XE -0D0  X(0,-E)-0 2 -637 


In Table 8.2, the observed frequency of each cell (O,) is listed in Column 2, and the 
expected frequency of each cell ( E;) is listed in Column 3. The computations for the chi-square 
goodness-of-fit test require that the observed and expected cell frequencies be compared with 
one another. In order to determine the expected frequency of a cell one must either: a) Employ 
the appropriate theoretical probability for the test model; or b) Employ a probability that is based 
on existing empirical data. 

In Examples 8.1 and 8.2, computation of the expected cell frequencies is based on the 
theoretical probabilities for the test model.? Specifically, if the die employed in Example 8.1 is 
fair, it is equally likely that in a given trial any one of the six face values will appear. Thus, it 
follows that each of the six face values should occur one-sixth of the time. The probability 
associated with each of the possible outcomes (represented by the notation m, which is the lower 
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case Greek letter pi) can be computed as follows: x = r/k (where: r represents the number of 
outcomes that will allow an observation to be placed in a specific category, and k represents the 
total number of possible outcomes in any trial). Since, in each trial only one face value will 
result in an observation being assigned to any one of the six categories, the value of the 
numerator for each of the six categories will equal 1. Since in each trial there are six possible 
outcomes, the value of the denominator for each of the six categories will equal six. Thus, for 
each category, m, =1/6 * Note that the sum of the k probabilities must equal 1, since if the value 1/6 
is added six times it sums to 1 (i.e., xum = 1). 

The same logic employed for Example 8.1 can be applied to Example 8.2. If it is equally 
likely that a person will take a book out of the library on any one of the six days of the week the 
library is open, it is logical to predict that on each day of the week one-sixth of the books will 
be taken out. Consequently, the value 1/6 will represent the expected probability for each of the 
six cells in Example 8.2. The expected frequency of each cell in Examples 8.1 and 8.2 is 
computed by multiplying the total number of observations by the probability associated with the 
cell. Equation 8.1 summarizes the computation of an expected frequency. 


E= (nmn) (Equation 8.1) 


Where: n represents the total number of observations 
m, represents the probability that an observation will fall within the i" cell 


Since in both Example 8.1 and 8.2 the total number of observations is n = 120, the expected 
frequency for each cell can be computed as follows: E, - (120)(1/6) - 20. 

Upon determining the expected cell frequencies, Equation 8.2 is employed to compute the 
test statistic for the chi-square goodness-of-fit test. 


(Equation 8.2) 








The operations described by Equation 8.2 are as follows: a) The expected frequency of 
each cell is subtracted from its observed frequency. This is summarized in Column 4 of Table 
8.2; b) For each cell, the difference between the observed and expected frequency is squared. 
This is summarized in Column 5 of Table 8.2; c) For each cell, the squared difference between 
the observed and expected frequency is divided by the expected frequency of the cell. This is 
summarized in Column 6 of Table 8.2; and d) The value of chi-square is computed by summing 
all of the values in Column 6. For both Examples 8.1 and 8.2, Equation 8.2 yields the value 
X = 67, 

Note that in Table 8.2 the sums of the observed and expected frequencies are identical. This 
must always be the case, and any time these sums are not equivalent it indicates that a compu- 
tational error has been made.’ It is also required that the sum of the differences between the 
observed and expected frequencies equals zero (i.e., 3(O; - E,) = 0). Any time the latter value 
does not equal zero, it indicates an error has been made. Since all of the (O, - E,) values are 
squared in Column 5, the sum of Column 6, which represents the value of x7, must always be 
a positive number. If a negative value is obtained for chi-square, it indicates an error has been 
made. The only time X? will equal zero is when O, - E, for all cells. 
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V. Interpretation of the Test Results 


The obtained value xy? = 6.7 is evaluated with Table A4 (Table of the Chi-Square Dis- 
tribution) in the Appendix. A general overview of the chi-square distribution and guidelines 
for interpreting the values in Table A4 can be found in Sections I and V of the single-sample 
chi-square test for a population variance (Test 3). 

The degrees of freedom that are employed in evaluating the results of the chi-square 
goodness-of-fit test are computed with Equation 8.3. 


df=k-1 (Equation 8.3) 


When Table A4 is employed to evaluate a chi-square value computed for the chi-square 
goodness-of-fit test, the following protocol is employed. The tabled critical values for the chi- 
square goodness-of-fit test are always derived from the right tail of the distribution. Thus, the 
tabled critical .05 chi-square value (to be designated Xs) will be the tabled chi-square value at 
the 95th percentile. In the same respect, the tabled critical .01 chi-square value (to be designated 
Xi ) will be the tabled chi-square value at the 99th percentile. The general rule is that the tabled 
critical chi-square value for a given level of alpha will be the tabled chi-square value at the 
percentile that corresponds to the value of (1 — a). In order to reject the null hypothesis, the 
obtained value of chi-square must be equal to or greater than the tabled critical value at the pre- 
specified level of significance. The aforementioned guidelines for determining tabled critical 
chi-square values are employed when the alternative hypothesis is stated nondirectionally (which, 
as noted earlier, is usually the case). The determination of tabled critical chi-square values in 
reference to a directional alternative hypothesis is discussed in Section VII. 

Applying the guidelines for a nondirectional analysis to Examples 8.1 and 8.2, the degrees 
of freedom are computed to be df= 6 — 1 = 5. The tabled critical .05 chi-square value for df= 5 
is rae = 11.07, which, as noted above, is the tabled chi-square value at the 95th percentile. The 
tabled critical .01 chi-square value for df = 5 is Xo = 15.09, which, as noted above, is the 
tabled chi-square value at the 99th percentile. Since the computed value y? = 6.7 is less than 
Xs - 11.07, the null hypothesis cannot be rejected at the .05 level. This result can be sum- 
marized as follows: X? (5) = 6.7, p > .05. Although there are some deviations between the 
Observed and expected frequencies in Table 8.2, the result of the chi-square goodness-of-fit test 
indicates there is a reasonably high likelihood that the deviations in the sample data can be 
attributed to chance. 

A summary of the analysis of Examples 8.1 and 8.2 with the chi-square goodness-of-fit 
test follows: a) In Example 8.1 the data do not suggest that the die is biased; and b) In Example 
8.2 the data do not suggest that there is any difference with respect to the number of books that 
are taken out of the library on different days of the week. 


VI. Additional Analytical Procedures for the Chi-Square 
Goodness-of-Fit Test and/or Related Tests 


1. Comparisons involving individual cells when k > 2 Within the framework of the chi- 
square goodness-of-fit test it is possible to compare individual cells with one another. To 
illustrate this, assume that we wish to address the following questions in reference to Examples 
8.1 and 8.2. 

a) In Example 8.1, is the observed frequency of 29 for the face value 6 higher than the 
combined observed frequency of the other five face values? Note that this is not the same thing 
as asking whether the face value 6 is more likely to occur when compared individually with any 
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of the other five face values. In order to answer the latter question, the observed frequency for 
the face value 6 must be contrasted with the observed frequency for the specific face value in 
which one is interested. 

b) In Example 8.2, is the observed frequency of 29 books for Saturday higher than the 
combined observed frequency of the other five days of the week? Note that this is not the same 
thing as asking whether a person is more likely to take a book out of the library on Saturday 
when compared individually with any one of the other five days of the week. In order to answer 
the latter question, the observed frequency for Saturday must be contrasted with the observed 
frequency of the specific day of the week in which one is interested. 

In order to answer the question of whether 6/Saturday occurs a disproportionate amount 
of the time, the observed frequency for 6/Saturday must be contrasted with the combined 
observed frequencies of the other five face values/days of the week. In order to do this, the 
original six-cell chi-square table is collapsed into a two-cell table, with one cell representing 
6/Saturday (Cell 1) and the other cell representing 1, 2, 3, 4, 5/M, T, W, Th, F (Cell 2). The 
expected frequency of Cell 1 remains 1, = 1/6, since if we are dealing with a random process, 
there is still a one in six chance that in any trial the face value 6 will occur, or that a person will 
take a book out of the library on Saturday. Thus: E, - (120)(1/6) - 20. The expected fre- 
quency of Cell 2 is computed as follows: E, - (120)(5/6) - 100. Note that the probability 
m, = 5/6 for Cell 2 is the sum of the probabilities of the other five cells. In other words, if it is 
randomly determined what face value appears on the die or on what day of the week a person 
takes a book out of the library, there is a five in six chance that a face value other than 6 will 
appear on any role of the die, and a five in six chance that a book is taken out of the library on 
a day of the week other than Saturday. Table 8.3 summarizes the data for the problem under 
discussion. 


Table 8.3 Chi-Square Summary Table When 7, = 1/6 and z, = 5/6 


2 (0; T Ey 

Cell 0, E, (0, - E) (0, - EY. — 
6/Saturday 29 20 9 81 4.05 
1,2,3,4,5/M, T, W, Th,F 91 100 -9 81 .81 
X0,-120 XE,-120 X(0,-E)-0 2 = 4.86 


Since there are k = 2 cells, df22—1 = 1. Employing Table A4 for df= 1, Xs = 3.84 and 
Xi = 6.63. Since the obtained value y? = 4.86 is larger than Xs = 3.84, the null hypothesis 
can be rejected at the .05 level (i.e., X? (1) = 4.86, p < .05). The null hypothesis cannot be 
rejected at the .01 level since y? = 4.86 < Xo = 6.63. Note that by stating the problem in 
reference to one face value or one day of the week, the researcher is able to reject the null 
hypothesis at the .05 level. Recollect that the analysis in Section V does not allow the researcher 
to reject the null hypothesis.” 

If the original null hypothesis a researcher intends to study deals with the frequency of Cell 
6/Saturday versus the other five cells, the researcher is not obliged to defend the analysis 
described above. However, let us assume that the original null hypothesis under study is the one 
stipulated in Section III. Let us also assume that upon evaluating the data, the null hypothesis 
cannot be rejected. Because of this the researcher then decides to reconceptualize the problem 
as summarized in Table 8.3. To go even further, the researcher can extend the type of analysis 
depicted in Table 8.3 to all six cells (1.e., compare the observed frequency of each of the cells with 
the combined observed frequency of the other five cells — e.g., Cell 1/Monday versus Cells 
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2, 3, 4, 5, 6/T, W, Th, F, S, as well as Cell 2/Tuesday versus Cell 1, 3, 4, 5, 6/M, W, Th, F, S, 
and so on for the other three cells). If a = .05 is employed for each of the six comparisons, the 
overall likelihood of committing at least one Type I error within the set of six comparisons will 
be substantially above .05 (to be exact, it will equal 1 - (1 - .05)° = .26). If within the set 
of six comparisons the researcher does not want more than a 5% chance of committing a Type 
Terror, it is required that the alpha level employed for each comparison be adjusted. Specifically, 
by employing a probability of .05/6 = .0083 per comparison, the researcher will insure that the 
overall Type I error rate will not exceed 5%. It should be noted however, that by employing a 
smaller alpha level per comparison, the researcher is reducing the power associated with each 
comparison. A detailed discussion of the protocol for adjusting the alpha level when conducting 
multiple comparisons can be found in Section VI ofthesingle-factor between-subjects analysis 
of variance (Test 21). 

It should also be noted that a researcher can reduce the number of degrees of freedom 
employed in a chi-square analysis by reconfiguring a table comprised of three or more cells into 
a table comprised of fewer cells. Reduction of the degrees of freedom will increase the 
likelihood of rejecting the null hypothesis, since the lower the value of the degrees of freedom, 
the lower the tabled critical chi-square value at a given level of significance. By employing the 
latter strategy, a researcher may be able convert a table with three or more cells which does not 
yield a significant result into a smaller table that does yield a significant result. Obviously, it 
would be inappropriate to employ such a strategy if its sole purpose is to milk a significant result 
out of a set of data. Any significant results obtained within the latter context have to be viewed 
with extreme caution, and should be replicated prior to being submitted for publication. 

It is also possible to conduct other comparisons in addition to the ones noted above. For 
example one can compare the observed frequencies for face values/days of the week 1, 2, 3/M, 
T, W with the observed frequencies for 4, 5, 6/Th, F, S. In such an instance there again will be 
two cells, with a probability of n, = 1/2 for each cell (since 1, = 3/6 = 1/2). A researcher can 
also break down the original six cell table into three cells — e.g., 1, 2/M, T versus 3, 4/W, Th 
versus 5, 6/F, S. In this instance, the probability for each cell will equal T, = 1/3 (since 
T, = 2/6 = 1/3). 

Another type of comparison that can be conducted is to contrast just two of the original six 
cells with one another. Specifically, let us assume we want to compare Cell 1/Monday with Cell 
2/Tuesday. Table 8.4 is employed to summarize the data for such a comparison. 


Table 8.4 Chi-Square Summary Table for Comparison 


2 (0, ul E,” 

Cell O; E, (0, - E) (0, - E) SC des 
1/Monday 20 17 3 9 53 
2/Tuesday 14 17 3 9 53 
XO, = 34 VE, = 34 (0, - E,) = 0 x3 = 1.06 


Note that in the above example, since we employ only two cells, the probability for each 
cell will be x; = 1/2. The expected frequency of each cell is obtained by multiplying m, = 1/2 
by the total number of observations in the two cells (which equals 34). As noted previously, in 
conducting a comparison such as the one above, a critical issue the researcher must address is 
what value of alpha to employ in evaluating the null hypothesis. If a = .05 is used, for df= 1, 
Xs = 3.84. The null hypothesis cannot be rejected, since the obtained value y? = 1.06 
< Xos = 3.84. 
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A major point that has been emphasized throughout the discussion in this section is that, 
depending upon how one initially conceptualizes a problem, there will generally be a number of 
different ways in which a set of data can be analyzed. Furthermore, after analyzing the full set 
of data, additional comparisons involving two or more categories can be conducted. The various 
types of comparisons that one can conduct can either be planned or unplanned. The term 
planned comparison is employed throughout the book to refer to a comparison that is planned 
prior to the data collection phase of a study. In contrast, an unplanned comparison is one that 
a researcher decides to conduct after the experimental data have been collected and scrutinized. 
A problem associated with unplanned comparisons is that in a large body of data there are a 
potentially large number of comparisons that can be conducted. Consequently, a researcher can 
conduct many comparisons until one or more of them yield a significant result. The latter 
strategy can thus be employed to milk significant results out of a large body of data. It was noted 
earlier in this section that the larger the number of comparisons one conducts, the greater the 
likelihood that any significant result obtained for a given comparison will be a Type I error (as 
opposed to a genuine difference that can be reliably replicated). 

Whenever possible comparisons should be planned, and most sources take the position that 
when a researcher plans a limited number of comparisons before the data collection phase of a 
study, one is not obliged to control the overall Type I error rate. However, when comparisons 
are not planned, most sources believe that some adjustment of the Type I error rate should be 
made in order to avoid inflating it excessively. As noted earlier in this section, one way of 
achieving the latter is to divide the maximum overall Type I error rate one is willing to tolerate 
by the total number of comparisons one conducts. The resulting probability value will represent 
the alpha level employed in evaluating each of the comparisons. A comprehensive discussion 
of the subject of comparisons (which is also germane to the issue of alternate ways of 
conceptualizing a set of data) can be found in Section VI of the single-factor between-subjects 
analysis of variance. 

In closing the discussion of comparisons for the chi-square goodness-of-fit test, it should 
be noted that some sources present alternative comparison procedures that may yield results that 
are not in total agreement with those obtained in this section. In instances where different meth- 
odologies yield substantially different results (which will usually not be the case), a replication 
study evaluating the same hypothesis is in order. As noted throughout the book, replication is 
the most effective way to demonstrate the validity of a hypothesis. Obviously, the use of large 
sample sizes in both original and replication studies further increases the likelihood of obtaining 
reliable results. An alternative approach for conducting comparisons is presented in the next 
section. 


2. The analysis of standardized residuals An alternative procedure for conducting compar- 
isons (developed by Haberman (1973) and cited in sources such as Siegel and Castellan (1988)) 
involves the computation of standardized residuals. By computing the latter values, one is able 
to determine which cells are the major contributors to a significant chi-square value. Equation 
8.4 is employed to compute a standardized residual (R,) for each cell in a chi-square table. 


_ (O; T E;) 


R; 

yE 

A value computed for a residual (which is interpreted as a normally distributed variable) 

is evaluated with Table A1 (Table of the Normal Distribution) in the Appendix. Any residual 
with an absolute value that is equal to or greater than the tabled critical two-tailed .05 value 
Zos = 1.96 is significant at the .05 level. Any residual with an absolute value that is equal to 
or greater than the tabled critical two-tailed .01 value zo, = 2.58 is significant at the .01 level. 
Any cell in a chi-square table which has a significant residual makes a significant contribution to 


(Equation 8.4) 
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the obtained chi-square value. For any cell that has a significant residual, one can conclude that 
the observed frequency of the cell differs significantly from its expected frequency. The sign of 
the standardized residual indicates whether the observed frequency of the cell is above (+) or 
below (—) the expected frequency. The sum of the squared residuals for all k cells will equal the 
obtained value of chi-square. Although the result of the chi-square analysis for Examples 8.1 and 
8.2 is not significant, the standardized residuals for the chi-square table are computed and 
summarized in Table 8.5. 


Table 8.5 Analysis of Residuals for Examples 8.1 and 8.2 








(0; -E 2_|(0; -E i 

Cell O; E; (0, - E) R = ———— Rjs|——— 

f f 
1/Monday 20 20 0 0 0 
2/Tuesday 14 20 -6 -1.34 1.80 
3/Wednesday 18 20 -2 —45 .20 
4/Thursday 17 20 -3 —.67 45 
5/Friday 22 20 2 45 .20 
6/Saturday 29 20 9 2.01 4.05 

X0,-120 XE, = 120 X(0, - E) = 0 X = 67 


Note that the only cell with a standardized residual with an absolute value above 1.96 is 
Cell 6/Saturday. Thus, one can conclude that the observed frequency of Cell 6/Saturday is 
significantly above its expected frequency and, as such, the cell would be viewed as a major 
contributor in obtaining a significant chi-square value (if, in fact, the computed chi-square value 
had been significant). It should be noted that this result is consistent with the first comparison 
that was conducted in the previous section, since the latter comparison indicates that the observed 
frequency of 29 for Cell 6/Saturday deviates significantly from its expected frequency, when 
the cell is contrasted with the combined frequencies of the other five cells. 


3. Computation of a confidence interval for the chi-square goodness-of-fit test The pro- 
cedure to be described in this section allows one to compute a confidence interval for the 
proportion of cases in the underlying population that falls within any cell in a one-dimensional 
chi-square table.® The true population proportion for a cell will be represented by the notation 
7,. The procedure to be described below is a large sample approximation of the confidence 
interval for a binomially distributed variable (which applies to the chi-square goodness-of-fit 
test model when k = 2). The analysis to be described in this section will assume that if k > 2, the 
original chi-square table is converted into a table consisting of two cells. 

Equation 8.5 is the general equation for computing a confidence interval for a population 
proportion for a specific cell, when there are k = 2 cells 


P, P DP ; 
Pi 7 Ip) zm s Py + Zian a (Equation 8.5) 


Where: p, represents the proportion of observations in Cell 1. In the analysis under discus- 
sion, Cell 1 will represent the single cell whose observed frequency is being compared 
with the combined observed frequencies of the remaining five cells. The value of p, 
is computed by dividing the number of observations in Cell 1 (which will be 
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represented by the notation x) by n (which represents the total number of obser- 
vations). Thus, p, - xin. 

D, = 1 - p, The value p, represents the proportion of observations in Cell 2. In the 
analysis under discussion, Cell 2 will represent the combined frequencies of the other 
five cells. p, can be computed by dividing the number of observations that are not 
in Cell 1 by the total number of observations. Thus, p, = (n - x)/n. 

Zz,/ represents the tabled critical value in the normal distribution below which a 
proportion (percentage) equal to [1 — (0/2)] of the cases falls. If the proportion 
(percentage) of the distribution that falls within the confidence interval is subtracted 
from 1 (100%), it will equal the value of a. 


If one wants to determine the 95% confidence interval, the tabled critical two-tailed .05 
value Z,; = 1.96 is employed in Equation 8.5. The tabled critical two-tailed .01 value 
Zo, = 2.58 is employed to compute the 99% confidence interval. The value ,/(p, p,)/n in 
Equation 8.5 represents the estimated standard error of the population proportion. The latter 
value is an estimated standard deviation of a sampling distribution of a proportion. 

If (as is done in Table 8.3) the data for Examples 8.1 and 8.2 are expressed in a format 
consisting of two cells, Equation 8.5 can be employed to compute a confidence interval for each 
of the six cells. Thus, if we wish to compute a confidence interval for the Cell 6/Saturday we 
can determine that p, = x/n = 29/120 = .242 and p, = (n-x)/n = (120 - 29)/120 = .758. 
Substituting the latter values and the value zo, = 1.96 in Equation 8.5, the 95% confidence 
interval is computed below. 


Dap = G06) ,| 629220599) rs a a (i 96): ,| Ee) 
120 120 


Tı = .242 + .077 


165 < m, < .319 


Thus, the researcher can be 95% confident that the true proportion of cases in the 
underlying population that falls in Cell 6/Saturday is a value between .165 and .319. Stated in 
probabilistic terms, there is a probability/likelihood of .95 that the true value of the population 
proportion falls within the range .165 to .319. 

The 99% confidence interval, which has a larger range, is computed below by employing 
Zo, = 2.58 in Equation 8.5. 


243 (58). 62296199) on, a 949 « 2,58). | EOD 
120 120 


m, = .242 + .101 
141 < m, < 343 


Thus, the researcher can be 99% confident that the true proportion of cases in the 
underlying population that falls in Cell 6/Saturday is a value between .141 and .343. Stated in 
probabilistic terms, there is a probability/likelihood of .99 that the true value of the population 
proportion falls within the range .141 to .343. 

The above described procedure can be repeated for the other five cells in Examples 8.1 and 
8.2. In each instance the observed frequency of a cell is evaluated in relation to the combined 
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observed frequencies of the remaining five cells.? Zar (1999, pp. 527—530) describes alternative 
procedures for computing a confidence interval for a binomially distributed variable. 


4. Brief discussion of the z test for a population proportion (Test 9a) and the single-sample 
test for the median (Test 9b) In Section I it is noted that when k = 2 and the value of n is large, 
the chi-square goodness-of-fit test provides a good approximation of the binomial distribution. 
Under the discussion of the binomial sign test for a single sample, two tests are described 
which yield equivalent results to those obtained with the chi-square goodness-of-fit test when 
k 2 2. The two tests are the z test for a population proportion and the single-sample test for 
the median. In the latter test, the two cells of the chi-square table are comprised of scores that 
fall above the median of a specific distribution and scores that fall below the median of the dis- 
tribution. For a full discussion of these tests, the reader should consult the discussion of the 
binomial sign test for a single sample. 


5. The correction for continuity for the chi-square goodness-of-fit test Although it is not 
generally discussed in reference to the chi-square goodness-of-fit test, a correction for con- 
tinuity (which is discussed under the Wilcoxon signed-ranks test (Test 6)) can be applied to 
Equation 8.2. The basis for employing the correction for continuity with the chi-square 
goodness-fit-test is that the test employs a continuous distribution to approximate a discrete 
distribution (specifically, the binomial or multinomial distributions). The correction for con- 
tinuity is based on the premise that if a continuous distribution is employed to estimate a discrete 
distribution, such an approximation will inflate the Type I error rate. By employing the 
correction for continuity the Type I error rate is ostensibly adjusted to be more compatible with 
the prespecified alpha value designated by the researcher. Equation 8.6 is the continuity- 
corrected chi-square equation for the chi-square goodness-of-fit test. 


k1(|O, - E| - .5Y 
X = 3 mE: ME (Equation 8.6) 


l 


Note that by subtracting .5 from the absolute value of the difference between each set of 
observed and expected frequencies, the chi-square value derived with Equation 8.6 will be lower 
than the value computed with Equation 8.2. The magnitude of the correction for continuity will 
be inversely related to the size of the sample. The correction for continuity for the chi-square 
goodness-of-fit test is only employed when there are k = 2 cells. This latter application of the 
correction is discussed under the z test for a population proportion. The use of the correction 
for continuity with other designs that employ the chi-square statistic is discussed under the chi- 
square test for r x c tables (Test 16). 


6. Application of the chi-square goodness-of-fit test for assessing goodness-of-fit for a 
theoretical population distribution In analyzing data there are situations when a researcher 
may want to determine whether a distribution of sample data conforms to a specific theoretical 
population (or probability) distribution. As is the case with the Kolmogorov-Smirnov goodness- 
of-fit test for a single sample (Test 7) and the Lilliefors test for normality (Test 7a), the chi- 
square goodness-of-fit test can also be employed for this purpose. Although the Kolmogorov- 
Smirnov and Lilliefors tests are designed to be employed with a continuous variable and the chi- 
square test is designed to be employed with a discrete variable, the latter test is sometimes 
employed to assess goodness-of-fit for a continuous variable. The most common application of the 
chi-square goodness-of-fit test with a continuous variable is in assessing goodness-of-fit for 
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a normal distribution, when the population mean and standard deviation have to be estimated 
from the sample data. Although the Kolmogorov-Smirnov test (which stipulates specific values 
for the population mean and standard deviation) and the Lilliefors test for normality (which, 
like the chi-square test, estimates the population mean and standard deviation from the sample 
data) are better suited for the latter purpose, the chi-square test is often used since it requires less 
computation (which in itself is not sufficient justification for employing a test). 

When the chi-square goodness-of-fit test is employed to assess goodness-of-fit for a 
theoretical distribution, Equation 8.3 (i.e., df = k — 1) is not appropriate for computing the 
degrees of freedom. In determining whether a distribution of sample data conforms to a specific 
theoretical distribution (such as the normal, binomial, or Poisson distributions, all of which will 
be or have been discussed at some point in the book), it may be necessary to estimate one or more 
population parameters prior to computing the expected frequency of each cell. In such a case, 
Equation 8.7 is employed to compute the degrees of freedom for the analysis. 


df=k-1-w (Equation 8.7) 
Where: w represents the number of parameters that must be estimated 


In actuality, df = k — 1 — w is the generic equation for computing the degrees of freedom for 
the chi-square goodness-of-fit test. Equation 8.3 (df = k — 1), which has been used in the 
examples discussed up to this point, represents the form the equation df 2 k — 1 — w assumes 
when w = 0. 

Example 8.3 will be employed to demonstrate the use of the chi-square goodness-of-fit 
test in assessing goodness-of-fit for a normal distribution. In point of fact, Example 8.3 is almost 
identical to Example 7.1, which is employed in evaluating the same hypothesis with both the 
Kolmogorov-Smirnov goodness-of-fit test for a single sample and Lilliefors test for normal- 
ity. However, the text of Example 8.3 states that the mean and estimated standard deviation of 
the population are estimated from the sample data (whereas the latter values are stipulated in 
Example 7.1). The values X = 90.07 and § = 34.79 (which are also employed for the 
Lilliefors test) noted in Example 8.3 were computed by employing Equations I.1 and I.8 with 
the 30 scores in the sample. 


Example 8.3 A researcher conducts a study to evaluate whether the distribution of the length 
of time it takes migraine patients to respond to a 100 mg. dose of an intravenously administered 
drug is normal. The amount of time (in seconds) that elapses between the administration of the 
drug and cessation of a headache for 30 migraine patients is recorded below. The 30 scores are 
arranged ordinally (i.e., from fastest response time to slowest response time). 


21, 32, 38, 40, 48, 55, 63, 66, 70, 75, 80, 84, 86, 90, 90, 93, 95, 98, 100, 105, 106, 108, 115, 118, 
126, 128, 130, 142, 145, 155 


_ The mean and standard deviation of the population are estimated from the sample data to 
be X = 90.07 and § = 34.79. Do the data conform to a normal distribution? 


In order to employ the chi-square goodness-of-fit test, the researcher must first estimate 
the values of the population mean (u) and standard deviation (6) by computing the values X and $ 
from the sample data. Since the latter requires the estimation of two population parameters (1.e., 
u and o), the appropriate degrees of freedom to employ for the analysis will be, df = k — 1 — 2. 
Because the value of k represents the number of cells that are employed in the analysis, there must 
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be a minimum of four cells. The latter is true, since if k is less than four, the value of df will be 
less than 1 (which is impossible). Each of the cells in the chi-square table will represent a class 
interval. A class interval is a limited range of values in which scores in a frequency distribution 
are grouped. As is the case with previous applications of the chi-square goodness-of-fit test, 
the expected frequency for each cell/class interval is computed and contrasted with its observed 
frequency. 

The null and alternative hypotheses that are evaluated with the chi-square goodness-of-fit 
test in reference to Example 8.3 can be stated either in the form presented in Section III, or as 
follows. 


Null hypothesis H,: The sample is derived from a normally distributed population. 


Alternative hypothesis H,: The sample is not derived from a normally distributed population. 
This is a nondirectional alternative hypothesis. 


The analysis of Example 8.3 with the chi-square goodness-of-fit test is summarized in 
Tables 8.6 and 8.7. 


Table 8.6 Class Intervals for Chi-Square Analysis of Example 8.3 


Cell/Class interval/Decile Limits for z values Limits for X values 
1st decile (0 to .10) -1.28 >z 45.54 > X 

2nd decile (>.10 to .20) —.84 > z> -1.28 45.54 <X < 60.85 
3rd decile (> .20 to .30) —.52 > z> —.84 60.85 <X < 71.98 
4th decile (> .30 to .40) —.25 > z> —52 71.98 <X < 81.37 
5th decile (> .40 to .50) 02z»-—25 81.37 « X « 90.07* 
6th decile (> .50 to .60) 25 >z2>0 90.07 < X < 98.77* 
7th decile (> .60 to .70) 132.2 2 > 325 98.77 «X < 108.16 
8th decile (> .70 to .80) 84 > z> .52 108.16 « X « 119.29 
9th decile (> .80 to .90) 1.28 > z> .84 119.29 « X < 134.60 
10th decile (> .90 to 1) 1.28 <z 134.60 <X 


*As a general rule, if two or more scores are equal to the value of x , one-half of the scores are assigned 
to the 5th decile and one-half to the 6th decile. If only one score equals X, it can be randomly assigned 
to either the 5th or 6th decile. 


Table8.7 Chi-Square Summary Table for Example 8.3 


Cell/Class (O, = Ey 
interval/ O, E, (0, - E) (0, - E,” + 
Decile E; 
1st decile 4 3 1 1 33 
2nd decile 2 3 -1 1 .33 
3rd decile 3 3 0 0 .00 
4th decile 2 3 -1 1 .33 
5th decile 2 3 -1 1 .33 
6th decile 5 3 2 4 1.33 
7th decile 4 3 1 1 .33 
8th decile 2 3 -1 1 .33 
9th decile 3 3 0 0 .00 
10th decile 3 3 0 0 .00 
X0, = 30 XE, = 30 LO, - E,) =0 xX =3.31 
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In Table 8.7 each of the n = 30 scores has been assigned to one of ten cells/categories. The 
ten cells, which are summarized in Table 8.6, correspond to the ten deciles of the normal dis- 
tribution. In the Introduction it is noted that a decile divides a distribution into blocks 
comprised of ten percentage points (or blocks that comprise a proportion equal to .10 of the 
distribution). The z scores that correspond to the limits of the ten deciles in a normal distribution 
were determined through use of Table A1 in the Appendix. Thus, the value z = -1.28 
corresponds to the upper limit of the 10th percentile, since the entry in Column 3 of Table A1 
for z = -1.28 is .1033 (which is the closest value to .1000, which is 10% when expressed as a 
percentage). Given that the value of X = 90.07 and § = 34.79, we can compute that the value 
X = 45.47 corresponds to z = -1.28 by employing the equation X = X + (z)($). The latter 
equation is the algebraic transposition of Equation I.27 (z = (X — u)/o)), when X is employed 
in place of u and § is employed in place of o. Thus, if we multiply the value z = -1.28 
by § = 34.79 and add X = 90.07 to the product, we obtain: X = 90.07 + (-1.28)(34.79) 
= 45.54. The latter value indicates that any score less than 45.54 falls in the first decile. 

The value z = —.84 corresponds to the upper limit of the 20th percentile, since the entry in 
Column 3 of Table Al for z 2 —84 is .2033 (which is the closest value to .2000, which is 20% 
when expressed as a percentage). Thus, the second decile will be represented by scores that fall 
above the proportion .10 (or the 10% point) up to the proportion .20 (or the 20% point). When 
the value z = —84 is substituted in the equation X = X + (z)(S), we obtain X = 90.07 
+ (—.84)(34.79) = 60.85. The value X = 60.85 is the upper limit of the 20th decile. Thus, any 
score that is greater than 45.54 but equal to or less than 60.85 falls in the 2nd decile. To 
complete Table 8.6, the procedure that has been described for the 1st and 2nd deciles was 
employed to determine the limits for the eight remaining deciles. 

Employing Equation 8.1, an expected frequency of 3 is computed for each of the cells in 
Table 8.7 (which is the chi-square summary table) by multiplying the sample size n = 30 by .1 
(i.e., E, = (30)(.1) = 3). The value .1 is employed to represent m, in the latter equation, since 
each cell represents an area that corresponds to 10% of a normal distribution. Thus, the likeli- 
hood of an observation falling in any of the cells/deciles is .1. 

Employing Equation 8.2, the value y? = 3.31 is computed for Example 8.3. Since there 
are k = 10 cells and w = 2 parameters that are estimated, the degrees of freedom for the analysis 
are df= 10—1—22 7. Employing Table A4, we determine that for df = 7 the tabled critical .05 
and .01 values are Xos = 14.07 and Xoi = 18.48. Since the computed value x7 = 3.31 is less 
than both of the aforementioned values, the null hypothesis cannot be rejected. Thus, the 
analysis does not indicate that the data deviate significantly from a normal distribution. This is 
consistent with the conclusion that was reached when the same set of data was evaluated with 
the Kolmogorov-Smirnov goodness-of-fit test for a single sample (which employed the popu- 
lation parameters u = 90 and o = 35, which are almost identical to the estimated values 
X = 90.07 and $ = 34.79) and the Lilliefors test for normality (which, like the chi-square 
test, employed the values X = 90.07 and $ = 34.79). 

It should be noted that because of the small sample size employed in the study, the expected 
frequency of 3 for all of the cells is less than the minimum value recommended for the 
chi-square goodness-of-fit test by many sources. The values of the expected frequencies could 
be increased by employing fewer cells in the chi-square table. In other words, one could employ 
quartile blocks (yielding four cells), or blocks consisting of 2046 of the cases per block (yielding 
five blocks), etc. Daniel (1990) notes that the outcome of the chi-square goodness-of-fit test 
is affected by the number of cells that are employed in the analysis, and cites studies (e.g., Dahiya 
and Gurland (1973)) that address this issue. Further discussion of the application of the chi- 
square goodness-of-fit test with a continuous variable can be found in Conover (1980, 1999), 
Daniel (1990), and Siegel and Castellan (1988). Conover (1980, 1999) and Daniel (1990) 
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describe the test protocol when the format of data is a frequency distribution that reflects the 
number of scores in each of k class intervals. The latter format is most likely to be employed if 
there are a large number of scores, and the researcher elects to group scores in class intervals 
(such as 20 scores falling within the range 1—10, 15 scores falling within the range 11-20, etc.), 
since it provides a succinct way of summarizing the data. In other instances where the original 
data are grouped in class intervals, a researcher may not have access to the exact value of each 
score, but have only a frequency distribution that categorizes each score within one of k class 
intervals. 


7. Sources for computing the power of the chi-square goodness-of-fit test Cohen (1977, 
1988) has developed a statistic called the w index that can be employed to compute the power 
of the chi-square goodness-of-fit test. The value w is an effect size index reflecting the 
difference between expected and observed frequencies. The concept of effect size is discussed 
in Section VI of the single-sample ¢ test (Test 2). It is discussed in greater detail in Section VI 
ofthe ¢ test for two independent samples (Test 11), and in Section IX (the Addendum) of the 
Pearson product-moment correlation coefficient (Test 28) under the discussion of meta- 
analysis and related topics. 

The equation for the w index is w = y Za PB . The latter equation 
indicates the following: a) For each of the cells in the chi-square table, the proportion of cases 
hypothesized in the null hypothesis is subtracted from the proportion of cases hypothesized in 
the alternative hypothesis; b) The obtained difference in each cell is squared, and then divided 
by the proportion hypothesized in the null hypothesis for that cell; c) All of the values obtained 
for the cells in part b) are summed; and d) w represents the square root of the sum obtained in 
part c). 

Cohen (1977; 1988, Ch. 7) has derived tables that allow a researcher to determine, through 
use of the w index, the appropriate sample size to employ if one wants to test a hypothesis about 
the difference between observed and expected frequencies in a chi-square table at a specified 
level of power. Cohen (1977; 1988, pp. 224—226) has proposed the following (admittedly 
arbitrary) w values as criteria for identifying the magnitude of an effect size: a) A small effect 
size is one that is greater than .1 but not more than .3; b) A medium effect size is one that is 
greater than .3 but not more than .5; and c) A large effect size is greater than .5. 





8. Heterogeneity chi-square analysis Assume that a researcher conducts m independent 
studies (where m » 2) which evaluate the same goodness-of-fit hypothesis, and that none of the 
studies yields a statistically significant result. However, visual inspection of the data suggests 
a consistent pattern of differences for the observed frequencies of the k categories employed in 
each of the m studies. The researcher suspects that because of the relatively small sample sizes 
employed in the studies, the absence of significant results is largely due to a lack of statistical 
power. In order to increase the power of the analysis, the researcher wants to combine the data 
for the m studies into one table, and evaluate the latter table with the chi-square goodness-of-fit 
test. Zar (1999, pp. 471—473) notes that the procedure for determining whether or not a 
researcher is justified in pooling data under such conditions is referred to as heterogeneity chi- 
square analysis (also referred to as interaction chi-square analysis or homogeneity chi-square 
analysis). 

The null and alternative hypotheses that are evaluated with a heterogeneity chi-square 
analysis are as follows. 


Null hypothesis H,: The m samples are derived from the same population (i.e., population 
homogeneity). 
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Alternative hypothesis H,: At least two of the m samples are not derived from the same 
population (population heterogeneity). 


Example 8.4 will be employed to illustrate the heterogeneity chi-square analysis. 


Example8.4 A researcher evaluates a hypothesis that in the lakes of a specific geographical 
region the number of fish representing three different species are equally distributed. Over a 
period of a year four separate studies are conducted, and each study is evaluated with a chi- 
square goodness-of-fit test. Although none of the studies yields a significant result (which if 
present would allow the researcher to conclude that the species of fish are not equally dis- 
tributed), visual inspection of the data suggests that Species 3 is more prevalent than either 
Species 1 or 2, and that of the three species, Species 1 is the least prevalent. Because he suspects 
that the nonsignificant results for the chi-square analyses may be due to a lack of statistical 
power, the researcher would like to combine the data for the four studies, and analyze the pooled 
data. Is the researcher justified in pooling the data? 


Table 8.8 summarizes the analysis of the data for Example 8.4. Part A of Table 8.8 
presents the chi-square goodness-of-fit analysis for each of the four individual studies. Column 
2 for each of the studies contains the observed species frequency for that study. In each study 
the expected frequency for any of the k = 3 cells is one-third of the total number of observations 
((.e., E, = (n)(1/3), since the latter implies that the species are equally distributed). 

The following protocol is employed in the heterogeneity chi-square analysis: a) A chi- 
square value is computed for each of the individual studies. (Although it is not the case in our 
example, Zar (1999) notes that if the number of cells per study is k = 2, the correction for 
continuity is not used in analyzing the individual tables.); b) The sum of the m chi-square values 
obtained in a) for the individual studies is computed. The latter value itself represents a chi- 
square value, and will be designated Xin In addition, the sum of the degrees of freedom for the 
m studies is computed. The latter degrees of freedom value, which will be designated df m» 1 
obtained by summing the value df = k— 1 m times; c) The data for the m studies is combined into 
one table, and the chi-square value, which will be designated Y od is computed for the pooled 
data. The degrees of freedom for the table with the pooled data, which will be designated 
df, coed? 1$ equal to df= k — 1. (Zar (1999) notes that if there are k = 2 cells, the correction for 
continuity is not used in analyzing the table with the pooled data); d) The heterogeneity chi- 
square analysis is based on the premise that if the m samples are in fact homogeneous, the sum 
of the m individual chi-square values Os) should be approximately the same value as the chi- 
square value computed for the pooled data ( a) In order to determine the latter, the absolute 
value of the difference between the sum of the m chi-square values (obtained in b)) and the 
pooled chi-square value (obtained in c)) is computed. The obtained difference, which is itself 
a chi-square value, is the heterogeneity chi-square value, which will be designated Xa. Thus, 
Xa = (Xm - x owl: The null hypothesis will be rejected when there is a large difference 
between the values of x;,,,, and oou The value XL. which represents the test statistic, is 
evaluated with a degrees of freedom value that is the sum of the degrees of freedom for the m 
individual studies ( df... ) less the degrees of freedom obtained for the table with the pooled data 
(df, soled): Thus, df, = df, - df, cold: 1n order to reject the null hypothesis, the value Kier 
must be equal to or greater than the tabled critical value at the prespecified level of significance 
for df,,,.3 and e) If the null hypothesis is rejected the data cannot be pooled. If, however, the null 
hypothesis is retained, the data can be pooled, and the computed value for TN is employed 
to evaluate the goodness-of fit hypothesis. Zar (1999) notes, however, that if there are k = 2 cells 


€ 2000 by Chapman & Hall/CRC 


Table 8.8 Heterogeneity Chi-Square Analysis for Example 8.4 


A. Chi-square analysis of four individual studies 


Study 1 
O, - Ey 
Cell/Species O, E, (0, - E) (0, - EY — 
1 10 15 -5 25 1.67 
2 15 15 0 0 0 
3 20 15 5 25 1.67 
X0,-45  XE,;-45 LO, - E)=0 X; = 3.33 
Study 2 
O.- EY 
Cell/Species O, E, (0, - E) (0,- EY — 
1 13 20 -7 49 2.45 
2 21 20 1 1 .05 
3 26 20 6 36 1.80 
L0 £860 = TE 200 340,.—E)-0 % = 4.30 
Study 3 
O, - Ey 
Cell/Species O, E, (0, - E) (0, - EY — 
1 19 25 -6 36 1.44 
2 22 25 -3 9 .36 
3 34 25 9 81 3.24 
30,275 XE -75 X(0,-E)-0 X5 25.04 
Study 4 
0, - EY 
Cell/Species O, E, (0, - E) (0,- EY — 
1 12 20 -8 65 3.20 
2 22 20 2 4 .20 
3 26 20 6 36 1.80 
LO, = 60 XE, 2-60 X(0,-E)-0 X; 25.20 


Sum of chi-square values for four studies — Xs = 3.33 + 4.30 + 5.04 + 5.20 = 17.87 
B. Chi-square analysis of pooled data 


Pooled data for m = 4 studies 


2 (0, 5 E,” 
Cell/Species O; E, (Q;- E) (0,-E) E 
1 54 80 -26 676 8.45 
2 80 80 0 0 0 
3 106 80 26 676 8.45 
2 
X0,2240 XE,-240 X(0,-E)z0 Xroviea = 16.90 


C. Heterogeneity of chi-square analysis 


Heterogeneity chi-square = Sum of chi-square values for four studies — Pooled chi-square value 
Xie z ros E 17.87) E uu » 16.90) = 97 
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(i.e., df = 1), the table for the pooled data should be reevaluated employing the correction for 
continuity, and the continuity-corrected Xoria value (which will be a little lower than the 
original dod value) should be employed to evaluate the goodness-of-fit hypothesis. 

The computed chi- square values for the four studies in Table 8.8 are X = 3.33, 
X» = 4.30, X = 5.04, and X = 5.20. Since, in each of the four chi-square tables, there are 
k=3 cells, df= k— 1 = 2 for each table. Thus, the total number of degrees of freedom employed 
for the four studies is df n = (4)(2) = 8 (i.e., the number of studies (4) multiplied by the number 
of degrees of freedom per study (2)). Since there are k = 3 cells in the table for the pooled data, 
the degrees of freedom for the latter table is df. bod * 1 = 2. By summing the chi-square 
values for the four studies, we compute the value Xm = = 3.33 + 4.30 + 5.04 + 5.20 = 17.87. 
Since, in Part B of Table 8.8, we compute Xpooted = = 16.90, the value Tor heterogeneity chi- 
square (computed in Part C of Table 8.8) is Get zo = 17.87) - OGooled = 16.90)| = .97. 
The degrees of freedom employed to evaluate the latter chi-square value are df,,, = (foum = 

n sug 54) 6. The tabled critical .05 and.01 chi- square values in Table A4 for df= 6 are 
X o5 = 12.59 and X o1 = 16.81. Since the computed value Wa = .97 is less than X05 = 12.59, 
the null hypothesis is retained. In other words we can conclude the four samples are homog- 
eneous (i.e., come from the same population), and thus we can justify pooling the data into a 
single table. 

As noted earlier, in Part B of Table 8.8 the value x? = 16.90 is computed for the pooled 
data. Since there are k = 3 cells in the chi-square table for the pooled data, df = k — 1 22. The 
tabled critical .05 and.01 values in Table A4 for df= 2 are Xs - 5.99 and Xoi = 9.21. Since 
the value Mod - 16.90 is larger than both of the aforementioned critical values, the goodness- 
of-fit null hypothesis for the pooled data can be rejected at both the .05 and .01 levels. In other 
words, with respect to the pooled data, we can conclude that in the case of at least one of the 
cells/species there is a difference between its observed and expected frequency. Without con- 
ducting additional comparisons, it appears that, as the researcher suspected, the observed 
frequency for Cell/Species 1 is significantly below its expected frequency, while the observed 
frequency for Cell/Species 3 is significantly above its expected frequency. Although it does 
not apply to our example, as noted earlier, Zar (1999) (who provides a comprehensive discussion 
of the heterogeneity chi-square analysis) states that if the number of cells in the table for the 
pooled data is k = 2, the latter table should be reevaluated employing the correction for 
continuity. 

It should be emphasized that a researcher should employ common sense in applying the 
heterogeneity chi-square analysis described in this section. To be more specific, there may be 
occasions when even though the computed value of XL. is not significant, in spite of the latter 
it would not be recommended that the researcher pool the data from two or more smaller tables. 
To be more specific, one should not pool data from two or more tables employing small sample 
sizes (which when evaluated individually fail to yield a significant chi-square value) in order to 
obtain a significant pooled chi-square value, when there is an obvious inconsistency in the cell 
proportions for two or more of the tables. In other words, when the data from m tables are 
pooled, the proportion of cases in the cells of each of the m tables should be approximately the 
same. Everitt (1977, 1992) and Fleiss (1981) recommend alternative procedures for pooling the 
data from multiple chi-square tables. 


VII. Additional Discussion of the Chi-Square Goodness-of-Fit Test 


1. Directionality of the chi-square goodness-of-fit test In Section III it is noted that most 
sources state the alternative hypothesis for the chi-square goodness-of-fit test nondirectionally, 
but that in actuality it is possible to state the alternative hypothesis directionally. This is most 
obvious when there are k = 2 cells and the expected probability associated with each cell is 
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7, - 1/2. Under the latter conditions a researcher can make two directional predictions, either 
one of which can represent the alternative hypothesis. Specifically, the following can be 
predicted with respect to the sample data: a) The observed frequency of Cell 1 will be 
significantly higher than the observed frequency of Cell 2 (which translates into the observed 
frequency of Cell 1 being higher than its expected frequency, and the observed frequency of Cell 
2 being lower than its expected frequency); and b) The observed frequency of Cell 2 will be 
significantly higher than the observed frequency of Cell 1 (which translates into the observed 
frequency of Cell 2 being higher than its expected frequency, and the observed frequency of Cell 
1 being lower than its expected frequency). 

If a researcher wants to evaluate either of the aforementioned directional alternative 
hypotheses at the .05 level, the appropriate critical value to employ is the tabled chi-square value 
(for df = 1) at the .10 level of significance. The latter value is represented by the tabled chi- 
square value at the 90th percentile (which demarcates the extreme 10% in the right tail of the chi- 
square distribution). This latter critical value will be designated as Xo throughout this discus- 
sion. The rationale for employing X10 in evaluating the directional alternative hypothesis at the 
.05 level is as follows. When k = 2 and the alternative hypothesis is stated nondirectionally, if 
a computed chi-square value is equal to or greater than X05 (for df= 1), the researcher can reject 
the null hypothesis if the data are consistent with either of the outcomes associated with the two 
possible directional alternative hypotheses. If that same tabled critical chi-square value is em- 
ployed to evaluate one of the two possible directional alternative hypotheses, a directional 
alternative hypothesis would be evaluated not at the .05 level but at one-half that value — in 
other words at the .025 level. Thus, if one wants to employ a = .05 and states the alternative 
hypothesis directionally, the alpha level for the directional alternative hypothesis should be .05 
multiplied by the number of possible directional alternative hypotheses (which in this instance 
equals 2). By employing the tabled critical value for Xio which is a lower value than Xs an 
alpha level (and Type I error rate) of .05 is established for the specific one-tailed alternative 
hypothesis that one employs. The area of the chi-square distribution that corresponds to the 
alpha level for the latter directional alternative hypothesis will be one-half of the area that 
comprises the extreme 10% of the right tail of the distribution. The area that corresponds to the 
alpha level for the other directional alternative hypothesis will be the remaining 5% of the 
extreme 10% in the right tail of the distribution. 

If we turn our attention to Examples 8.1 and 8.2, in both examples there are, in fact, 720 
possible directional predictions the researcher can make! The latter value is determined as 
follows: k! = 6! = (6)(5)(4)(3)(2)(1)  720.!! In other words, a researcher can predict any one 
of 720 ordinal configurations with respect to the observed frequencies of the six cells. As an 
example, the researcher might predict that 1/Monday will have the highest observed frequency, 
followed in order by 2/Tuesday, 3/Wednesday, 4/Thursday, 5/Friday, and 6/Saturday, and 
only be willing to reject the null hypothesis if the data are consistent with this specific ordering 
of the observed cell frequencies. 

Later in this section it will be explained that when k = 6, it is not possible to evaluate any 
one of the 720 directional alternative hypotheses at either the .05 or .01 level of significance. 
Indeed, under these conditions the highest value of alpha that can be employed to evaluate a 
directional alternative hypothesis is approximately .001. In point of fact, when k = 6, the tabled 
critical Xor value is approximately 2.7, which happens to be the tabled chi-square value at the 
28th percentile. This is the case, since if the prespecified alpha value that is employed to evaluate 
a nondirectional alternative hypothesis (which in this case we will assume is .001, since it is the 
highest value that will work with k = 6) is multiplied by the number of possible directional 
alternative hypotheses (720) we obtain: (.001)(720) = .72. The value .72 demarcates the extreme 
72% in the right tail of the chi-square distribution when df = 5. Thus, it corresponds to the 28th 
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percentile of the distribution, since (1 — .72) 2.28. Consequently, in order to reject the null hy- 
pothesis with reference to one of the 720 possible directional alternative hypotheses, both of the 
following conditions will have to be met: a) The obtained value of chi-square will have to be 
equal to or greater than the tabled chi-square value at the 28th percentile; and b) The data will 
have to be consistent with the directional alternative hypothesis that is employed. In other words, 
the ordinal relationship between the observed frequencies of the six cells should be in the exact 
order stated in the directional alternative hypothesis. 

As noted previously, none of the 720 possible directional alternative hypotheses can be 
evaluated at either the .05 or .01 levels. This is the case since, if either .05 or .01 is multiplied 
by 720, the resulting product exceeds unity (1) — i.e., (.05)(720) = 36 and (.01)(720) = 7.2. 
Since both 36 and 7.2 exceed 1, they cannot be used as probability values. Thus, it is impossible 
to evaluate any of the directional alternative hypotheses at either the .05 or .01 level. In fact, the 
largest alpha level at which any of the directional alternative hypotheses can be evaluated is 
.001388, since 1/720 = .001388 (i.e., the value .001388 is the maximum number which when 
multiplied by 720 falls short of 1. Specifically, (.001388)(720) = .99936). 

On initial inspection it might appear that by employing Xn to evaluate one of the 
directional alternative hypotheses, a researcher is employing an inflated alpha level." But as just 
noted, in this instance the alpha level for a directional alternative hypothesis is, in fact, .001. Of 
course a researcher may elect to employ a larger critical chi-square value, and if one elects to do 
so, the actual alpha level for a directional alternative hypothesis will be even lower than .001. 
For example, if the tabled critical value Xs = 11.07 (which is the tabled value for df = 5 at the 
95th percentile that is employed in evaluating a nondirectional alternative hypothesis) is 
employed to evaluate one of the 720 possible directional alternative hypotheses, the actual alpha 
level that one will be using in evaluating the directional alternative hypothesis will be .05/720 
— .00007. In such a case there is obviously a minuscule likelihood of committing a Type I error 
in reference to the directional alternative hypothesis. Yet at the same time, the power of the 
analysis with respect to that alternative hypothesis will be minimal (thus resulting in a high 
likelihood of committing a Type II error). 

It is also possible in Examples 8.1 and 8.2 to state an alternative hypothesis that predicts 
two or more, but less than 720, of the possible ordinal configurations with respect to the observed 
frequencies of the k = 6 cells. In other words, a directional alternative hypothesis might state that 
the null hypothesis can only be rejected if the magnitude of the observed cell frequencies in 
descending order is either Cell 1, Cell 2, Cell 3, Cell 4, Cell 5, Cell 6 or Cell 2, Cell 1, Cell 3, 
Cell 4, Cell 5, Cell 6. In such a case, to evaluate the null hypothesis at the .001 level with 
respect to an alternative hypothesis involving two of the 720 possible ordinal configurations, the 
tabled critical chi-square value at the 64th percentile is employed (1.e., Xhe since the area of the 
distribution involved is the extreme 36% ((1— .64) = .36) that falls in the right tail). The general 
procedure for computing the percentile rank in the chi-square distribution in determining the 
critical value when evaluating one or more configurations is as follows: a) Divide the total 
number of possible configurations by the number of acceptable configurations stated in the 
directional alternative hypothesis; b) Multiply the result of the division by the prespecified alpha 
level; and c) Subtract the value obtained in part b) from 1. 

Applying this protocol to a directional alternative hypothesis in which only 2 out of 
720 configurations are acceptable, we derive: a) 720/2 = 360; b) (360)(.001) = .36; and 
c) (1 — .36) = .64. The resulting value of .64 (which can be converted to 64%) corresponds to 
the percentile rank in the chi-square distribution to employ in determining the critical value. The 
value .36 obtained in part b) represents the overall proportion of the right tail of the distribution 
which contains a proportion of the distribution equivalent to .001 that represents the rejection 
zone for the directional alternative hypothesis under study. 
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It should be noted that since 2/720 = .0028, an alternative hypothesis involving 2 out of 720 
possible configurations can be evaluated at a level above .001. In point of fact, such an alter- 
native hypothesis can be evaluated at any level equal to or less than .0028. Thus if one elects to 
employ œ = .002, using the protocol described in the previous paragraph, (720/2)(.002) = .72, 
and (1 — .72) = .28. The latter result indicates that Xn = 2.7 is once again employed as the 
critical value. This is the case, since the latter value represents the tabled value at 28th percentile 
for df = 5. 

In closing this discussion, the reader should take note of the fact that all of the critical 
values for the chi-square goodness-of-fit test are derived from the right tail of the chi-square 
distribution. In point of fact, with the exception of the single-sample chi-square test for a 
population variance (in which case critical values are derived from both tails of the 
distribution), all of the tests in the book that employ the chi-square distribution only use critical 
values from the right tail of the distribution. 


2. Additional goodness-of-fit tests In addition to the chi-square goodness-of-fit test, the 
Kolmogorov-Smirnov goodness-of-fit test for a single sample, and Lilliefors test for nor- 
mality (both of which have been alluded to in this chapter), there are a number of other tests 
that have been developed for evaluating goodness-of-fit. Three other tests discussed in the book 
that can be employed to assess goodness-of-fit for a normal distribution are the single sample 
test for evaluating population skewness (Test 4), the single sample test for evaluating 
population kurtosis (Test 5), and the D'Agostino-Pearson test of normality (Test 5a). Most 
of the alternative goodness-of-fit tests (including the Kolmogorov-Smirnov and Lilliefors tests) 
evaluate scores that are assigned to ordered categories." Among the other goodness-of-fit 
tests that have been developed for evaluating ordered categorical data are David’s empty cell 
test (David (1950)), the Cramér-von Mises goodness-of-fit test (attributed to Cramér (1928), 
von Mises (1931), and Smirnov (1936)), and the Shapiro-Wilk test for normality (Shapiro and 
Wilk (1965, 1968)) (which is described in Conover (1980, 1999)). D’ Agostino and Stephens 
(1996) and Daniel (1990) contain comprehensive discussions of alternative goodness-of-fit pro- 
cedures. 

The general subject of goodness-of-fit tests for randomness is discussed in Section VII of 
the single-sample runs tests (Test 10). Autocorrelation, a procedure that can also be employed 
for assessing goodness-of-fit for randomness, is discussed in Section VII of the Pearson 
product-moment correlation coefficient. 


VIII. Additional Examples Illustrating the Use of the Chi-Square 
Goodness-of-Fit Test 


Three additional examples that can be evaluated with the chi-square goodness-of-fit test are 
presented in this section. Example 8.5 employs the same data set as Examples 8.1 and 8.2, and 
thus yields the same results. Examples 8.6 and 8.7 illustrate the application of the chi-square 
goodness-of-fit test to data in which the expected frequencies are based on existing empirical 
information or theoretical conjecture rather than on expected/theoretical probabilities. 


Example 8.5 The owner of the Big Wheel Speedway, a stock car racetrack, asks a researcher 
to determine whether or not there is any bias associated with the lane to which a car is assigned 
at the beginning of a race. Specifically, the owner wishes to determine if there is an equal 
likelihood of winning a race associated with each of the six lanes of the track. The researcher 
examines the results of 120 races and determines the following number of first place finishes for 
the six lanes: Lane 1 — 20; Lane 2 — 14; Lane 3 — 18; Lane 4 — 17; Lane 5 — 22; Lane 6 — 29. 
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Example8.6 A country in which four ethnic groups make up the population establishes affirma- 
tive action guidelines for medical school admissions. The country has one medical school, and 
it is mandated that each new class of medical students proportionally represents the four ethnic 
groups that comprise the country's population. The four ethnic groups that make up the 
population and the proportion of people in each ethnic group are: Balzacs (.4), Crosacs (.25), 
Murads (.3), and Isads (.05).? The number of students from each ethnic group admitted into the 
medical school class for the new year are: Balzacs (300), Crosacs (220), Murads (400), and 
Isads (80). Is there a significant discrepancy between the proportions mandated in the affirma- 
tive action guidelines and the actual proportion of the four ethnic groups in the new medical 
school class? 


Except for the fact that empirical data are used as a basis for determining the expected 
cell frequencies, this example is evaluated in the same manner as Examples 8.1 and 8.2. There 
are k = 4 cells — each cell representing one of the four mutually exclusive ethnic groups. The 
observed frequencies are the number of students from each of the four ethnic groups out of the 
total of 1000 students who are admitted to the medical school (i.e., n = 300 + 220 + 400 + 80 
— 1000). The expected frequencies are computed based upon the proportion of each ethnic group 
in the population. Each of these values is obtained by multiplying the total number of medical 
school admissions (1000) by the proportion of a specific ethnic group in the population. Thus 
in the case of the Balzacs, employing Equation 8.1 the expected frequency is computed as 
follows: E, - (1000)(.4) - 400. In the same respect, the expected frequencies for the other 
three ethnic groups are: Crosacs: E, = (1000)(.25) = 250;Murads: E, = (1000)(.3) = 300; 
and Isads: E, - (1000)(.05) - 50. Table 8.9 summarizes the observed and expected 
frequencies and the resulting values employed in the computation of the chi-square value through 
use of Equation 8.2. 


Table 8.9 Chi-Square Summary Table for Example 8.6 


2 (0j B Ey. 
Cell 0, E, (0, - E) (0, - E) oa 
Balzacs 300 400 -100 10000 25 
Crosacs 220 250 -30 900 3.6 
Murads 400 300 100 10000 33.3 
Isads 80 50 30 900 18 
XO, = 1000 LE, = 1000 2XX(0,-E)-0 X = 799 


Employing Table A4, we determine that for df = 4 — 1 23 the tabled critical .05 and .01 
values are Xos = 7.81 and Xo = 11.34. (A nondirectional alternative hypothesis is assumed.) 
Since the computed value X? = 79.9 is greater than both of the aforementioned critical values, 
the null hypothesis can be rejected at both the .05 or .01 levels. Based on the chi-square analysis 
it can be concluded that the medical school admissions data do not adhere to the proportions 
mandated in the affirmative action guidelines. Inspection of Table 8.9 suggests that the sig- 
nificant difference is primarily due to the presence of too many Murads and Isads and too few 
Balzacs. This observation can be confirmed by employing the appropriate comparison procedure 
described in Section VI. 


Example 8.7 A physician who specializes in genetic diseases develops a theory which predicts 
that two-thirds of the people who develop a disease called cyclomeiosis will be males. She 
randomly selects 300 people who are afflicted with cyclomeiosis and observes that 140 of them 
are females. Is the physician's theory supported? 
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In Example 8.7 there are two cells, each representing one gender. Since the expected 
frequencies are computed on the basis of the probabilities hypothesized in the physician's theory, 
two-thirds of the sample are expected to be males and the remaining one-third females. Thus, 
the respective expected frequencies for males and females are determined as follows: Males: 
E, = (300)(2/3) = 200; Females: E, = (300)(1/3) = 100. Since 140 females are observed 
with the disease, the remaining 160 people who have the disease must be males. Table 8.10 
summarizes the observed and expected frequencies and the resulting values that are employed 
in the computation of the chi-square value with Equation 8.2. 


Table 8.10 Chi-Square Summary Table for Example 8.7 


2 (O, B Ey 

Cell 0, E, (0, - E) (0,- EP. —— 
Males 160 200 —40 1600 8 
Females 140 100 40 1600 16 
£O, = 300 XE -30  X(0,-E)-0 xb = 24 


Employing Table A4, we determine that for df = 2 — 1 = 1 the tabled critical .05 and .01 
values are Xas = 3.84 and Xo = 6.63.5 Since the computed value x? = 24 is greater than 
both of the aforementioned critical values, the null hypothesis can be rejected at both the .05 or 
.01 levels. Based on the chi-square analysis, it can be concluded that the observed distribution 
of males and females for the disease is not consistent with the doctor's theory. 
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Endnotes 


1. Categories are mutually exclusive if assignment to one of the k categories precludes a 
subject/object from being assigned to any one of the remaining (k — 1) categories. 


2.  Thereason why the exact probabilities associated with the binomial and multinomial dis- 
tributions are generally not computed is because, except when the value of n is very small, 
an excessive amount of computation is involved. The binomial distribution is discussed 
under the binomial sign test for a single sample, and the multinomial distribution is 
discussed in Section IX (the Addendum) of the latter test. 


3. Example 8.5 in Section VIII illustrates an example in which the expected frequencies are 
based on prior empirical information. 


4. [tis possible for the value of the numerator of a probability ratio to be some value other than 
1. For instance, if one is evaluating the number of odd versus even numbers that appear on n 
rolls of a die, in each trial there are k = 2 categories. Three face values (1, 3, 5) will result 
in an observation being categorized as an odd number, and three face values (2, 4, 6) will 
result in an observation being categorized as an even number. Thus, the probability associ- 
ated with each of the two categories will be 3/6 = 1/2. It is also possible for each of the 
categories to have different probabilities. Thus, if one is evaluating the relative occurrence 
of the face values 1 and 2 versus the face values 3, 4, 5, and 6, the probability associated 
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with the former category will be 2/6 = 1/3 (since two outcomes fall within the category 1/2), 
while the probability associated with the latter will be 4/6 = 2/3 (since four outcomes fall 
within the category 3/4/5/6). Examples 8.6 and 8.7 in Section VIII illustrate examples 
where the probabilities for two or more categories are not equal to one another. 


5. When decimal values are involved, there may be a minimal difference between the sums 
of the expected and observed frequencies due to rounding off error. 


6. There are some instances when Equation 8.3 should be modified to compute the degrees 
of freedom for the chi-square goodness-of-fit test. The modified degrees of freedom 
equation is discussed in Section VI, within the framework of employing the chi-square 
goodness-of-fit test to assess goodness-of-fit for a normal distribution. 


7. Sometimes when one or more cells in a set of data have an expected frequency of less than 
five, by combining cells (as is done in this analysis) a researcher can reconfigure the data 
so that the expected frequency of all the resulting cells is greater than five. Although this 
is one way of dealing with the violation of the assumption concerning the minimum 
acceptable value for an expected cell frequency, the null hypothesis evaluated with the 
reconfigured data will not be identical to the null hypothesis stipulated in Section III. 


8. Inaone-dimensional chi-square table, subjects/objects are assigned to categories which 
reflect their status on a single variable. In a two-dimensional table, two variables are 
involved in the categorization of subjects/objects. As an example, if each of n subjects is 
assigned to a category based on one's gender and whether one is married or single, a two- 
dimensional table can be constructed involving the following four cells: Male-Married; 
Female-Married; Male-Not married; Female-Not married. Note that people assigned 
to a given cell fall into one of the two categories on each of the two dimensions/variables 
(which are gender and marital status). Analysis of two-dimensional tables is discussed 
under the chi-square test for r x c tables. In Section VII of the latter test, tables with more 
than two dimensions (commonly referred to as multidimensional contingency tables) are 
also discussed. 


9. Daniel (1990) notes that the procedure described in this section will only yield reliable 
results when nz, and n(1 - m) are both greater than 5. It is assumed that the researcher 
estimates the value of 1, prior to collecting the data. The researcher bases the latter value 
either on probability theory or preexisting empirical information. Generally speaking, if 
the value of n is large, the value of p, should provide a reasonable approximation of the 
value of x, for calculating the values nz, and n(1 - mj). 


10. Since, when k = 2, it is possible to state two directional alternative hypotheses, some 
sources refer to an analysis of such a nondirectional alternative hypothesis as a two-tailed 
test. Using the same logic, when k > 2 one can conceptualize a test of a nondirectional 
alternative hypothesis as a multi-tailed test (since, when k > 2, it is possible to have more 
than two directional alternative hypotheses). It should be pointed out that since the chi- 
square goodness-of-fit test only utilizes the right tail of the distribution, itis questionable 
to use the terms two-tailed or multi-tailed in reference to the analysis of a nondirectional 
alternative hypothesis. 


11. k!is referred to ask factorial. The notation indicates that the integer number preceding the 
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! is multiplied by all integer values below it. Thus, k! = (K)(k - 1) ... (1). By definition 0! 
is set equal to 1. A method of computing an approximate value for n! was developed by 
James Stirling (1730). (The letter n is more commonly employed as the notation to 
represent the number for which a factorial value is computed.) Stirling's approximation 
(described in Feller (1968), Miller and Miller (1999) and Zar (1999)) is n! - J2nx (n/ e) 
which can also be written as n! - (20 (n"*°)(e"). As noted in Endnote 5 in the 
Introduction, the value e in the Stirling equation is the base of the natural system of 
logarithms. e, which equals 2.71828...., is an irrational number (i.e., a number that has 
a decimal notation that goes on forever without a repeating pattern of digits). 


12. The subscript .72 in the notation Xn represents the .72 level of significance. The value .72 
is based on the fact that the extreme 72% of the right tail of the chi-square distribution is 
employed in evaluating the directional alternative hypothesis. The value Xn falls at the 
28th percentile of the distribution. 


13. The two most commonly employed (as well as discussed) goodness-of-fit tests are the 
chi-square goodness-of-fit test and the Kolmogorov-Smirnov goodness-of-fit test for 
a single sample. A general discussion of differences between the latter two tests can be 
found in Section VII of the Kolmogorov-Smirnov goodness-of-fit test for a single 
sample. 


14. When categories are ordered, there is a direct (or inverse) relationship between the mag- 
nitude of the score of a subject on the variable being measured and the ordinal position of 
the category to which that score has been assigned. An example of ordered categories 
which can be employed with the chi-square goodness-of-fit test are the following four 
categories that can be used to indicate the magnitude of a person's IQ: Cell 1 — Ist 
quartile; Cell 2 — 2nd quartile; Cell 3 — 3rd quartile; and Cell 4 — 4th quartile. The 
aforementioned categories can be employed if one wants to determine whether, within a 
sample, an equal number of subjects are observed in each of the four quartiles. Note that 
in Examples 8.1 and 8.2, the fact that an observation is assigned to Cell 6 is not indicative 
of a higher level of performance or superior quality than an observation assigned to Cell 
1 (or vice versa). However, in the IQ example, there is a direct relationship between the 
number used to identify each cell and the magnitude of IQ scores for subjects who have 
been assigned to that cell. 


15. Note that the sum of the proportions must equal 1. 
16. Even though a nondirectional alternative hypothesis will be assumed, this example 


illustrates a case in which some researchers might view it more prudent to employ a 
directional alternative hypothesis. 
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Test 9 
The Binomial Sign Test for a Single Sample 


(Nonparametric Test Employed with Categorical/Nominal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test In an underlying population comprised of two categories that 
is represented by a sample, is the proportion of observations in one of the two categories equal 
to a specific value? 


Relevant background information on test The binomialsign test for a single sample is based 
on the binomial distribution, which is one of a number of discrete probability distributions 
discussed in this chapter. A discrete probability distribution is a distribution in which the 
values a variable may assume are finite (as opposed to a continuous probability distribution 
in which a variable may assume an infinite number of values). The basic assumption underlying 
the binomial distribution is that each of n independent observations (i.e., the outcome for any 
given observation is not influenced by the outcome for any other observation) is randomly 
selected from a population, and that each observation can be classified in one of k = 2 mutually 
exclusive categories. Within a binomially distributed population, the likelihood that an obser- 
vation will fall in Category 1 will equal 7, and the likelihood that an observation will fall in 
Category 2 will equal 75. Since it is required that t, + m, = l,itfollows that x, = 1 - m, S 
The mean (u, which is also referred to as the expected value) and standard deviation (o) of a 
binomially distributed variable are computed with Equations 9.1 and 9.2.’ 


= nt, (Equation 9.1) 
o = nn, (Equation 9.2) 


When 7, = m, = .5, the binomial distribution is symmetrical. When m, < .5, the 
distribution is positively skewed, with the degree of positive skew increasing as the value of 7, 
approaches zero. When m, » .5, the distribution is negatively skewed, with the degree of 
negative skew increasing as the value of x, approaches one. The sampling distribution of a 
binomially distributed variable can be approximated by the normal distribution. The closer the 
value of 7, is to .5 and the larger the value of n, the better the normal approximation. Because 
of the central limit theorem (which is discussed in Section VII of the single-sample z test (Test 
1)), even if the value of n is small and/or the value of z, is close to either 0 or 1, the normal 
distribution still provides a reasonably good approximation of the sampling distribution for a 
binomially distributed variable. 

The binomial sign test for a single sample employs the binomial distribution to determine 
the likelihood that x or more (or x or less) of n observations that comprise a sample will fall in one 
of two categories (to be designated as Category 1), if, in the underlying population, the true 
proportion of observations in Category 1 equals m. When there are k = 2 categories, the 
hypothesis evaluated with the binomial sign test for a single sample is identical to that evaluated 
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with the chi-square goodness-of-fit test (Test 8). Since the two tests evaluate the same hy- 
pothesis, the hypothesis for the binomial sign test for a single sample can also be stated as 
follows: In the underlying population represented by a sample, are the observed frequencies for 
the two categories different from their expected frequencies? As noted in Section I of the chi- 
square goodness-of-fit test, the binomial sign test for a single sample is generally employed 
for small sample sizes since, when the value of n is large, the computation of exact binomial 
probabilities becomes prohibitive without access to specialized tables or the appropriate computer 
software. 


II. Examples 


Two examples will be employed to illustrate the use of the binomial sign test for a single 
sample. Since both examples employ identical data, they will result in the same conclusions with 
respect to the null hypothesis. 


Example 9.1 An experiment is conducted to determine whether a coin is biased. The coin is 
flipped ten times resulting in eight heads and two tails. Do the results indicate that the coin is 
biased? 


Example 9.2 Ten women are asked to judge which of two brands of perfume has a more 
fragrant odor. Eight of the women select Perfume A and two of the women select Perfume B. 
Is there a significant difference with respect to preference for the perfumes? 


III. Null versus Alternative Hypotheses 


Null hypothesis Hym, 5 


(In the underlying population the sample represents, the true proportion of observations in 
Category 1 equals .5.) 


Alternative hypothesis Hy: 1, # 5 


(In the underlying population the sample represents, the true proportion of observations in 
Category 1 is not equal to .5. This is anondirectional alternative hypothesis, and it is evaluated 
with a two-tailed test. In order to be supported, the observed proportion of observations in 
Category 1 in the sample data (which will be represented with the notation p,) can be either 
significantly larger than the hypothesized population proportion z, = .5 or significantly smaller 
than z, = .5. 


or 
Hym,» .5 


(In the underlying population the sample represents, the true proportion of observations in 
Category 1 is greater than .5. This is a directional alternative hypothesis, and it is evaluated 
with a one-tailed test. In order to be supported, the observed proportion of observations in 
Category | in the sample data must be significantly larger than the hypothesized population 
proportion z, = .5.) 


or 


Hym, < .5 
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(In the underlying population the sample represents, the true proportion of observations in 
Category 1 is less than .5. This is a directional alternative hypothesis, and it is evaluated with 
a one-tailed test. In order to be supported, the observed proportion of observations in Category 
1 in the sample data must be significantly smaller than the hypothesized population proportion 
mT, = 5.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected. 


IV. Test Computations 


In Example 9.1 the null and alternative hypotheses reflect the fact that it is assumed that if one 
is employing a fair coin, the probability of obtaining Heads in any trial will equal .5 (which is 
equivalent to 1/2). Thus, the expected/theoretical probability for Heads is represented by 
m, = .5. If the coin is fair, the probability of obtaining Tails in any trial will also equal .5, and 
thus the expected/theoretical probability for Tails is represented by m, = .5. Note that 
my + 7, = 1. In Example 9.2, it is assumed that if there is no difference with regard to pref- 
erence for the two brands of perfume, the likelihood of a woman selecting Perfume A will equal 
n, = .5 and the likelihood of selecting Perfume B will equal m, = .5. In both Examples 9.1 and 
9.2 the question that is being asked is as follows: If n = 10 and m, = m, = .5, what is the 
probability of 8 or more observations in one of the two categories?* Table 9.1 summarizes the 
outcome of Examples 9.1 and 9.2. The notation x is employed to represent the number of obser- 
vations in Category 1 and the notation (n — x) is employed to represent the number of observations 
in Category 2. 


Table 9.1 Model for Binomial Sign Test for a Single Sample 
for Examples 9.1 and 9.2 


Category 
1 (Heads/Perfume A) 2 (Tails/Perfume B) Total 
x = 8 n-x-10-8-2 n - 10 


In Examples 9.1 and 9.2, the proportion of observations in Category/Cell 1 is p,= 8/10=.8 
(i.e., p, = n,/n, where n, is the number of observations in Category 1), and the proportion of 
observations in Category/Cell 2 is p, = 2/10 = .2 (Le. p, = n,/n, where n, is the number of 
observations in Category 2). Equation 9.3 can be employed to compute the probability that 
exactly x out of a total of n observations will fall in one of the two categories. 


P(x) = | "| ay (qm) (Equation 9.3) 
x 
The term | " | in Equation 9.3 is referred to as the binomial coefficient and is computed 
X 
with Equation 9.4. | "| is more generally referred to as the number of combinations of n 
. x 
things taken x at a time.’ 
! 

| ij TNNT LR (Equation 9.4) 


x x! (n - x)! 


€ 2000 by Chapman & Hall/CRC 


In the case of Examples 9.1 and 9.2, the binomial coefficient will be ($) , which is the 


combination of 10 things taken 8 at a time. In the combination expression, n = 10 represents the 
total number of coin tosses/women and x = 8 represents the observed frequency for Category 


1(Heads/Perfume A). When the latter value (which equals | ri = ET = 45) is multiplied by 
(.5)® (.5)’, it yields the probability of obtaining exactly 8 Heads/Perfume A if there are 10 


observations. The probability of 8 observations in 10 trials will be represented by the notation 
P(8/10). The value P(8/10) = .0439 is computed below. 


P(8/10) = G (5E (5)? = (45)(.5)8.5)?_ = .0439 


Since the computation of binomial probabilities can be quite tedious, such probabilities are 
more commonly derived through the use of tables. By employing Table A6 (Table of the 
Binomial Distribution, Individual Probabilities) in the Appendix, the value .0439 can be 
obtained without any computations. The probability value .0439 is identified by employing the 
section of the Table A6 for n = 10. Within this section, the value .0439 is the entry in the cell 
that is the intersection of the row x = 8 and the column x = .5 (where m, = .5 is employed to 
represent the value of x). 

The probability .0439, however, does not provide enough information to allow one to 
evaluate the null hypothesis. The actual probability that is required is the likelihood of obtaining 
a value that is equal to or more extreme than the number of observations in Category 1. Thus, 
in the case of Examples 9.1 and 9.2, one must determine the probability of obtaining a frequency 
of 8 or greater for Category 1. In other words, we want to determine the likelihood of obtaining 
8, 9, or 10 Heads/Perfume A if the total number of observations is n = 10. Since we have 
already determined that the probability of obtaining 8 Heads/Perfume A is .0439, we must now 
determine the probability associated with the values 9 and 10. Although each of these prob- 
abilities can be computed with Equation 9.3, it is quicker to use Table A6. Employing the table, 
we determine that for x = .5 and n = 10, the probability of obtaining exactly x = 9 observations 
in Category 1 is P(9/10) = .0098, and the probability of obtaining exactly x = 10 observations is 
P(10/10) 2.0010. The sum of the three probabilities P(8/10), P(9/10), and P(10/10) represents 
the likelihood of obtaining 8 or more Heads/Perfume A in 10 observations. Thus: P(8, 9, or 
10/10) = .0439 + .0098 + .0010 = .0547. Equation 9.5 summarizes the computation of a 
cumulative probability such as that represented by P(8, 9, or 10/10) = .0547.° 


P x) = Y k m) a)" - ? (Equation 9.5) 


r-x 


Where: 7 , indicates that probability values should be summed beginning with the desig- 


r-x 


nated value of x up through the value n 


An even more efficient way of obtaining the probability P(8, 9, or 10/10) = .0547 is to 
employ Table A7 (Table of the Binomial Distribution, Cumulative Probabilities) in the 
Appendix. When employing Table A7 we again find the section for n = 10, and locate the cell 
that is the intersection of the row x = 8 and the column x = .5. The entry .0547 in that cell 
represents the probability of 8 or more (i.e., 8, 9, or 10) Heads/Perfume A, if there is a total of 
n= 10 observations. Thus, the entry in any cell of Table A7 represents (for the appropriate value 
of 7) the probability of obtaining a number of observations that is equal to or greater than the 
value of x in the left margin of the row in which the cell is located. 
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Table A7 can be used to determine the likelihood of x being equal to or less than a specific 
value. In such a case, the cumulative probability associated with the value of (x + 1) is subtracted 
from 1. To illustrate this, let us assume that in Examples 9.1 and 9.2 we want to determine the 
probability of obtaining 2 or less observations in one of the two categories (which applies to 
Category 2). In such an instance the value x = 2 is employed, and thus, x + 1 = 3. The 
cumulative probability associated with x = 3 (for m = m, = .5) is .9453. If the latter value is 
subtracted from 1 it yields 1 — .9453 = .0547, which represents the likelihood of obtaining 2 or 
less observations in a cell when n = 10." The value .0547 can also be obtained from Table A6 
by adding up the probabilities for the values x = 2 (.0439), x = 1 (.0098), and x = 0 (.0010). 

It should be noted that none of the values listed for x in Tables A6 and A7 exceeds .5. To 
employ the tables when the value for m, stated in the null hypothesis is greater than .5, the 
following protocol is employed: a) Use the value of 7, (i.e., m, = 1 - m,)torepresentthe value 
of x; and b) Each of the values of x is subtracted from the value of n, and the resulting values 
are employed to represent x in using the table for the analysis. To illustrate, let us assume that 
n-10,n, = .7,and x 2 9, and that we wish to determine the probability that there are 9 or more 
observations in one of the categories. Employing the above guidelines, the tabled value to use 
for mis z, = .3 (since 1—.7=.3). Since each value of x is subtracted from n, the values x = 9 
and x = 10 are respectively converted into x = 1 and x = 0. In Table A6 (for x = .3) the 
probabilities associated with x = 1 and x = 0 will respectively represent those probabilities 
associated with x = 9 and x = 10. The sum of the tabled probabilities for x = 1 and 
x = 0 represents the likelihood that there will be 9 or more observations in one of the categories. 
From Table A6 we determine that for x = .3, P(1/10) = .1211 and P(0/10) = .0282. Thus, 
P(O or 1/10) = .1211 + .0282 = .1493 (which also represents P(9 or 10/10) when 2, -.7). The 
value .1493 can also be obtained from Table A7 by subtracting the tabled probability value 
for (x + 1) from 1 (make sure that in computing (x + 1), the value of x that results from 
subtracting the original value of x from n is employed). Thus, if the converted value of x = 1, 
then x + 1 2 2. The tabled value in Table A7 for x = 2 and z= .3 is .8507. When the latter value 
is subtracted from 1, it yields .1493.* 


V. Interpretation of the Test Results 


When the binomial sign test for a single sample is applied to Examples 9.1 and 9.2, it provides 
a probabilistic answer to the question of whether or not p, - 8/10 - .8 (i.e., the observed 
proportion of cases for Category 1) deviates significantly from the value zx, = .5 stated in the 
null hypothesis? The following guidelines are employed in evaluating the null hypothesis.'? 

a) If anondirectional alternative hypothesis is employed, the null hypothesis can be rejected 
if the probability of obtaining a value equal to or more extreme than x is equal to or less than o/2 
(where a represents the prespecified value of a). If the proportion p, = x/n is greater than 7, , 
a value that is more extreme than x will be any value that is greater than the observed value of 
x, Whereas if the proportion p, = x/n is less than 7,, a value that is more extreme than x will be 
any value that is less than the observed value of x. 

b)Ifa directional alternative hypothesis is employed that predicts the underlying population 
proportion is above a specified value, to reject the null hypothesis both of the following con- 
ditions must be met: 1) The proportion of cases observed in Category 1 (p, ) must be greater than 
the value of z, stipulated in the null hypothesis; and 2) The probability of obtaining a value 
equal to or greater than x is equal to or less than the prespecified value of a. 

c) If a directional alternative hypothesis is employed that predicts the underlying population 
proportion is below a specified value, to reject the null hypothesis both of the following 
conditions must be met: 1) The proportion of cases observed in Category 1 (p, ) must be less than 
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the value of z, stipulated in the null hypothesis; and 2) The probability of obtaining a value equal 
to or less than x is equal to or less than the prespecified value of a. 

Applying the above guidelines to the results of the analysis of Examples 9.1 and 9.2, we 
can conclude the following. 

If a = .05, the nondirectional alternative hypothesis H,: n, * .5 is not supported, since the 
obtained probability .0547 is greater than o/2 = .05/2 = .025. In the same respect, if a = .01, the 
nondirectional alternative hypothesis H,: x, # .5 is not supported, since the obtained probability 
.0547 is greater than o/2 = .01/2 = .005. 

If a = .05, the directional alternative hypothesis H,: x, > .5 is not supported, since the 
obtained probability .0547 is greater than a = .05. In the same respect, if a = .01, the directional 
alternative hypothesis H,: m, > .5 is not supported, since the obtained probability .0547 is 
greater than a = .01. 

The directional alternative hypothesis H,: x, < .5 is not supported, since p, = .8 is 
larger than the value z, = .5 predicted in the null hypothesis. If the alternative hypothesis 
H: n, < .5 is employed and the sample data are consistent with it, in order to be supported the 
obtained probability must be equal to or less than the prespecified value of alpha. 

To summarize, the results of the analysis of Examples 9.1 and 9.2 do not allow a researcher 
to conclude that the true population proportion is some value other than .5. In view of this, in 
Example 9.1 the data do not allow one to conclude that the coin is biased. In Example 9.2, the 
data do not allow one to conclude that women exhibit a preference for one of the two brands of 
perfume. 

It should be noted that if the obtained proportion p, - .8 had been obtained with a larger 
sample size, the null hypothesis could be rejected. To illustrate, if for x, =.5,n=15, x= 12 
and thus p, - .8, the likelihood of obtaining 12 or more observations in one of the two 
categories is .0176. The latter value is significant at the .05 level if the directional alternative 
hypothesis H,: n, > .5 is employed, since it is less than the value a = .05. It is also significant 
at the .05 level if the nondirectional alternative hypothesis H,: n, # .5 is employed, since .0176 
is less than 9/2 = .05/2 = .025. 


VI. Additional Analytical Procedures for the Binomial Sign Test 
for a Single Sample and/or Related Tests 


1. Test 9a: The z test for a population proportion When the size of the sample is large the 
test statistic for the binomial sign test for a single sample can be approximated with the chi- 
square distribution — specifically, through use of the chi-square goodness-of-fit test. An 
alternative and equivalent approximation can be obtained by using the normal distribution. 
When the latter distribution is employed to approximate the test statistic for the binomial sign 
test for a single sample, the test is referred to as the z test for a population proportion. The 
null and alternative hypotheses employed for thez test for a population proportion are identical 
to those employed for the binomial sign test for a single sample. 

Although sources are not in agreement with respect to the minimum acceptable sample size 
for use with thez test for a population proportion, there is general agreement that the closer the 
value az, (or 7,) is to either 0 or 1 (i.e., the further removed it is from .5), the larger the sample 
size required for an accurate normal approximation. Among those sources that make recom- 
mendations with respect to the minimum acceptable sample size (regardless of the value of z, ) 
are Freund (1984) and Marascuilo and McSweeney (1977) who state that the values of both nz, 
and nz, should be greater than 5. Daniel (1990) states that n should be at least equal to 12. Siegel 
and Castellan (1988), on the other hand, note that when 7, is close to .5, the test can be 
employed when n > 25, but when 7, is close to 1 or 0, the value nnn, should be greater than 
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9. In view of the different criteria stipulated in various sources, one should employ common 
sense in interpreting results for the normal approximation based on small samples sizes, 
especially when the values of m, or 2, are close to 0 or 1. Since, when the sample size is small 
the normal approximation tends to inflate the Type I error rate, the error rate can be adjusted by 
conducting a more conservative test (1.e., employ a lower alpha level). A more practical 
alternative, however, is to use a test statistic that is corrected for continuity. As will be 
demonstrated in the discussion to follow, when the correction for continuity is employed for the 
z test for a population proportion, the test statistic will generally provide an excellent 
approximation of the binomial distribution even when the size of the sample is small and/or the 
values of n, and m, are far removed from .5. 

Examples 9.3-9.5 will be employed to illustrate the use of the z test for a population 
proportion. Since the three examples use identical data they will result in the same conclusions 
with respect to the null hypothesis. It will also be demonstrated that when the chi-square 
goodness-of-fit test is applied to Examples 9.3—9.5 it yields equivalent results. 


Example 9.3 An experiment is conducted to determine if a coin is biased. The coin is flipped 
200 times resulting in 96 heads and 104 tails. Do the results indicate that the coin is biased? 


Example 9.4 Although a senator supports a bill which favors a woman's right to have an abor- 
tion, she realizes her vote could influence whether or not the people in her state endorse her bid 
for reelection. In view of this she decides that she will not vote in favor of the bill unless at least 
50% of her constituents support a woman's right to have an abortion. A random survey of 200 
voters in her district reveals that 96 people are in favor of abortion. Will the senator support the 
bill? 


Example 9.5 In order to determine whether or not a subject exhibits extrasensory ability, a 
researcher employs a list of 200 binary digits (specifically, the values 0 and 1) which have been 
randomly generated by a computer. The researcher conducts an experiment in which one of his 
assistants concentrates on each of the digits in the order it appears on the list. While the assis- 
tant does this, the subject, who is in another room, attempts to guess the value of the number for 
each of the 200 trials. The subject correctly guesses 96 of the 200 digits. Does the subject 
exhibit evidence of extrasensory ability? 


As is the case for Examples 9.1 and 9.2, Examples 9.3—9.5 are evaluating the hypothesis 
of whether or not the true population proportion is .5. Thus, the null hypothesis and the non- 
directional alternative hypothesis are: Hy: zx, = .5 versus H,: m, # .5." 

The test statistic forthe z testfor a population proportion is computed with Equation 9.6. 


-m 
TENE CSS (Equation 9.6) 


T7 





n 


The denominator of Equation 9.6 nm, /n), which is the standard deviation of the 
sampling distribution of a proportion, is commonly referred to as the standard error of the 
proportion. 

For Examples 9.3-9.5, based on the null hypothesis we know that m; = .5 and 
m, = 1 - n, =.5. From the information that has been provided, we can compute the follow- 
ing values: p, - 96/200 - .48; p, - (200 - 96)/200 - 104/200 - .52. When the relevant 
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values are substituted in Equation 9.6, the value z = —57 is computed. 
G 48 - 50 _ 25] 
(5)05) 


200 
Equation 9.7 is an alternative form of Equation 9.6 that will yield the identical z value. 


X —- nT 
z=! (Equation 9.7) 


Jmm, 


In Section I it is noted that the mean and standard deviation of a binomially distributed 
variable are respectively u = nz, and o = jmm,z,. These values represent the mean and 
standard deviation of the underlying sampling distribution. In the numerator of Equation 9.7, the 
value u = zm, represents the expected number of observations in Category 1 if, in fact, the 
population proportion is equal to z, = .5 (i.e., the value stipulated in the null hypothesis). Thus, 
for the examples under discussion, u = (200)(.5) = 100. Note that the latter expected value is 
subtracted from the number of observations in Category 1. The denominator of Equation 9.7 is 
the standard deviation of a binomially distributed variable. Thus, in the case of Examples 
9.3-9.5, o = y(100)(.5)(.5) = 7.07. Employing Equation 9.7, the value z = —.57 (which is 
identical to the value computed with Equation 9.6) is obtained. 


,-96-Q05 _ 
Q00)C5)C5) 


57 


The obtained value z = —.57 is evaluated with Table A1 (Table of the Normal Distri- 
bution) in the Appendix. In Table A1 the tabled critical two-tailed .05 and .01 values are 
Zos = 1.96 andz,, = 2.58, and the tabled critical one-tailed .05 and .01 values are zo, = 1.65 
and Zo = 2.33. 

The following guidelines are employed in evaluating the null hypothesis. 

a) If the alternative hypothesis employed is nondirectional, the null hypothesis can be 
rejected if the obtained absolute value of z is equal to or greater than the tabled critical two-tailed 
value at the prespecified level of significance. 

b) If the alternative hypothesis employed is directional and predicts a population proportion 
larger than the value stated in the null hypothesis, the null hypothesis can be rejected if the sign 
of z is positive and the value of z is equal to or greater than the tabled critical one-tailed value at 
the prespecified level of significance. 

c) If the alternative hypothesis employed is directional and predicts a population proportion 
smaller than the value stated in the null hypothesis, the null hypothesis can be rejected if the sign 
of z is negative and the absolute value of z is equal to or greater than the tabled critical one-tailed 
value at the prespecified level of significance. 

Using the above guidelines, the null hypothesis cannot be rejected regardless of which of 
the three possible alternative hypotheses is employed. The nondirectional alternative hypothesis 
H: m, # .5 isnot supported, since the absolute value z = .57 is less than the tabled critical two- 
tailed value zy, = 1.96. The directional alternative hypothesis H,: m, > .5 is not supported, 
since to be supported, the sign of z must be positive. The directional alternative hypothesis 
H,: n, < .5 is not supported, since although the sign of z is negative as predicted, the absolute 
value z = .57 is less than the tabled critical one-tailed value zo, = 1.65. 

It was noted previously that when there are k = 2 categories, the chi-square goodness-of-fit 
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test also provides a large sample approximation of the test statistic for the binomial sign test for 
a single sample. In point of fact, the large sample approximation based on the chi-square 
goodness-of-fit test will yield results that are equivalent to those obtained with the z test for a 
population proportion, and the relationship between the computed chi-square value and the 
obtained z value for the same set of data will always be 3? = z?. Table 9.2 summarizes the 
results of the analysis of Examples 9.3—9.5 with the chi-square goodness-of-fit test, which, as 
noted earlier, evaluates the same hypothesis as the binomial sign test for a single sample. The 
null hypothesis and nondirectional alternative hypothesis when used in reference to the chi- 
square goodness-fit-test can also be stated employing the following format: H,: 0, = £, for 
both cells versus H,: o; + £, for both cells. 


Table 9.2 Chi-Square Summary Table for Examples 9.3—9.5 


2 (O; g E; y 
Cell 0, E, (0; - E) (0, - E) z 
Heads/Pro-Abortion/ 
Correct Guesses 96 100 -4 16 .16 
Tails/Anti-Abortion/ 
Incorrect Guesses 104 100 4 16 .16 
£O, = 200 XE,-200 X(0,-E)-0 2 = 32 


In Table 9.2. the expected frequency of each cell is computed by multiplying the hy- 
pothesized population proportion for the cell by n = 200 (i.e., employing Equation 8.1, 
E, = (n)(m,) = (200)(.5) = 100). Since k = 2, the degrees of freedom employed for the chi- 
square analysis are df = k — 1 22. The value X? = .32 (which is obtained with Equation 5.2) 
is evaluated with Table A4 (Table of the Chi-Square distribution) in the Appendix. For 
df = 1, the tabled critical .05 and .01 chi-square values are Ys - 3.84 and Xi - 6.63. Since 
the obtained value y? = .32 is less than Yos 7 3.84, the null hypothesis cannot be rejected if the 
nondirectional alternative hypothesis H,: zt, + .5 is employed. If the directional alternative 
hypothesis H,: 7, < .5 is employed it is not supported, since X = 32 is less than the tabled 
critical one-tailed .05 value Los = 2.71 (which corresponds to the chi-square value at the 90th 
percentile). 

As noted previously, if the z value obtained with Equations 9.6 and 9.7 is squared, it will 
always equal the chi-square value computed for the same data. Thus, in the current example 
where z = —57 and y? = .32: (-.57) = .32. (The minimal discrepancy is the result of 
rounding off error.) It is also the case that the square of a tabled critical z value at a given level 
of significance will equal the tabled critical chi-square value at the corresponding level of 
significance. This is confirmed for the tabled critical two-tailed z and y? values at the .05 and 
01 levels of significance: (zy, = 1.96)? = (y, = 3-84) and (zy, = 2.58) = (Xo, = 6.63). 

To summarize, the results of the analysis for Examples 9.3—9.5 do not allow one to 
conclude that the true population proportion is some value other than .5. In view of this, in 
Example 9.3 the data do not allow one to conclude that the coin is biased. In Example 9.4 the 
data do not allow the senator to conclude that the true proportion of the population that favors 
abortion is some value other than .5. In Example 9.5 the data do not allow one to conclude that 
the subject exhibited extrasensory abilities. 

It was noted previously that when thez test for a population proportion is employed with 
small sample sizes, it tends to inflate the likelihood of committing a Type I error. This is 
illustrated below with the data for Examples 9.1 and 9.2 which are evaluated with Equation 9.6. 
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Since the obtained value z = 1.90 is greater than the tabled critical one-tailed value 
Zos = 1.65, the directional alternative hypothesis H,: 1, > .5 is supported at the .05 level. 
When the binomial sign test for a single sample is employed to evaluate the latter alternative 
hypothesis for the same data, the result falls just short of being significant at the .05 level. The 
nondirectional alternative hypothesis H,: x, * .5, which is not even close to being supported 
with the binomial sign test for a single sample, falls just short of being supported at the .05 
level when the z test for a population proportion is employed (since the tabled critical two- 
tailed .05 valuesis zo; = 1.96). When the conclusions reached with respect to Example 9.1 and 
9.2 employing the binomial sign test for a single sample and the z test for a population 
proportion are compared with one another, it can be seen that the z test for a population 
proportion is the less conservative of the two tests (i.e., it is more likely to reject the null 
hypothesis). 


The correction for continuity for z test for a population proportion It is noted in the 
discussions of the Wilcoxon signed-ranks test (Test 6) and the chi-square goodness-of-fit test 
that many sources recommend that a correction for continuity be employed when a continuous 
distribution is employed to estimate a discrete probability distribution. Most sources recommend 
the latter correction when the normal distribution is used to approximate the binomial 
distribution, since the correction will adjust the Type I error rate (which will generally be inflated 
when the normal approximation is employed with small sample sizes). Equations 9.8 and 9.9 are, 
respectively, the continuity-corrected versions of Equations 9.6 and 9.7. Each of the continuity- 
corrected equations is applied to the data for Examples 9.3-9.5. 


1 
me 2n _ (2)(200) _ 


MTs (.5)(.5) 
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- -49 (Equation 9.8) 








-.49 (Equation 9.9) 


As is the case when Equations 9.6 and 9.7 are employed to compute the absolute value z 
= .57, the absolute value z = .49 computed with Equations 9.6 and 9.7 is less than the tabled 
critical two-tailed .05 value z,, = 1.96 and the tabled critical one-tailed .05 value z,, = 1.65. 
Thus, regardless of which of the possible alternative hypotheses one employs, the null hypothesis 
cannot be rejected. Note that the continuity-corrected absolute value z = .49 is less than the 
absolute value z= .57 obtained with Equations 9.6 and 9.7. Since a continuity-corrected equation 
will always result in a lower absolute value for z, it will provide a more conservative test of the 
null hypothesis. The smaller the sample size, the greater the difference between the values 
computed with the continuity-corrected and uncorrected equations. 

Equation 8.6, which as noted previously is the continuity-corrected equation for the 
chi-square goodness-of-fit test, can also be employed with the same data and will yield an 
equivalent result to that obtained with Equations 9.8 and 9.9. When employing Equation 8.6 there 
are two cells and each cell has an expected frequency of 100. The observed frequencies of the 


© 2000 by Chapman & Hall/CRC 


two cells are 96 and 104. Thus for each cell, (|O, - E,| - .5) = 3.5. Thus: 


k 
1 =), 


i=l 


(JO, - E| - .5? 
E, 


L 











- Gay , 65 
100 100 


Note that (.49)° = .245 (once again the slight discrepancy is due to rounding off error). 
As is the case with y? = .32 (the uncorrected chi-square value computed in Table 9.2), the 
continuity-corrected value y? = .245 is not significant, since it is less than the tabled critical .05 
value Yos - 3.84 (in reference to the nondirectional alternative hypothesis). 

Although in the case of Examples 9.3—9.5 the use of the correction for continuity does not 
alter the decision one can make with respect to the null hypothesis, this will not always be the 
case. To illustrate this, Equation 9.8 is employed below to compute the continuity-corrected 
value of z for Examples 9.1 and 9.2. 


l8 - .5| l 
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Since the continuity-corrected value z = 1.58 is less than the tabled critical one-tailed 
value z,; = 1.65, the directional alternative hypothesis H,: x, > .5 is not supported. This is 
consistent with the result that is obtained when the binomial sign test for a single sample is 
employed. Recollect that when the data are evaluated with Equation 9.6 (i.e., without the 
continuity correction) the directional alternative hypothesis H,: 1, > .5 is supported. Thus, it 
appears that in this instance the continuity correction yields a result that is more consistent with 
the result based on the exact binomial probability. 

Sources are not in agreement with respect to whether or not a correction for continuity 
should be employed. Zar (1999) cites a study by Ramsey and Ramsey (1988) which found that 
the results with the correction for continuity are overly conservative (i.e., more likely to retain 
the null hypothesis when it should be rejected). 


Computation of a confidence interval for the z test for a population proportion Equation 
8.5, which is described in the discussion of the chi-square goodness-of-fit test, can also be 
employed for computing a confidence interval for the z test for a population proportion. 
Equation 8.5 is employed below to compute the 95% confidence interval for Examples 9.3—9.5 


for Category 1. 
Pı P2 Pi P> 
Pi ~ Z) —— s T < Pi t Zo ae 
n n 
48 - (1.96) , SPO? ca < 48 + (1,96) | HOY 
200 200 


n, = .48 + .069 
All < m, < 549 


Thus, the researcher can be 95% confident (or the probability is .95) that the true proportion 
of cases in the underlying population in Category 1 is a value between .411 and .549. The 
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confidence interval for the population proportion for Category 2 (i.e., 7.) can be obtained by 
adding and subtracting the value .069 to and from the value p, = .52. Thus, .451 < m, > .589. 

Alternative procedures for computing a confidence interval for a binomially distributed 
variable are described in Zar (1999, pp. 527—530). 


Extension of z test for a population proportion to evaluate the performance of m subjects 
on n trials on a binomially distributed variable Example 9.6 illustrates a case in which 
each of m subjects is evaluated for a total of n trials on a binomially distributed variable. The 
example represents an extension of the analysis used for Example 9.5 to a design involving m 
subjects. The methodology employed in analyzing the data is basically an extension of the 
single-sample z test to an analysis of a population proportion that is based on a binomially 
distributed variable. 


Example 9.6 In order to determine whether or not a group of 10 people exhibit extrasensory 
ability, a researcher employs as test stimuli a list of 200 binary digits (specifically, the values 0 
and 1) which have been randomly generated by a computer. The researcher conducts an 
experiment in which one of his assistants concentrates on each of the digits in the order it 
appears on the list. While the assistant does this, each of the 10 subjects, all of whom are in 
separate rooms, attempts to guess the value of the number for each of 200 trials. The number 
of correct guesses in 200 trials for each of the subjects follows: 102, 104, 100, 98, 96, 80, 110, 
120, 102, 128. Does the group as a whole exhibit evidence of extrasensory abilities? 


Equation 9.10 is employed to evaluate Example 9.6. 


EE Suis 
ER 


m 


Where: m represents the number of subjects in the sample 
u = nn, 


o = ynm, 


The basic structure of Equation 9.10 is the same as that of Equation 1.3 (z = (X - Jos), 
which is the equation for the single-sample z test. In the numerator of Equation 1.3, the 
hypothesized population mean is subtracted from the sample mean. Equation 9.10 employs the 
analogous values — employing the sample mean (X) and u = nn, to represent the hypothesized 
population mean. The denominator of Equation 1.3 represents the standard deviation of a 
sampling distribution that is based on a sample size of n for what is assumed to be a normally dis- 
tributed variable. The denominator of Equation 9.10 represents the standard deviation of a 
sampling distribution of a binomially distributed variable that is based on a sample size of m. 
In both equations the denominator can be summarized as follows: o//number of subjects. It 
should also be noted that when the number of subjects is m = 1, Equation 9.10 reduces to 
Equation 9.7.” 

Note that in Example 9.6, each of the m = 10 subjects is tested for n = 200 trials. On each 
trial it is assumed that a subject has a likelihood of x, = .5 of being correct and a likelihood of 
T, - .5 of being incorrect. Since we are dealing with a binomially distributed variable, the 
expected number of correct responses for each subject, as well as the expected average number 
of correct responses for the group of m = 10 subjects, is u = nm,. As previously noted, the 
standard deviation of the sampling distribution for a single subject is defined by o = ,/nz,7,. 


z (Equation 9.10) 
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Since there are m subjects, the standard deviation of the sampling distribution for m subjects will 
be o//m, which is the denominator of Equation 9.10. 

The null and alternative hypotheses for Example 9.6 are identical to those employed for 
Example 9.5. The only difference is that, whereas in Example 9.5 H, and H, are stated in ref- 
erence to the population of scores for a single subject, in Example 9.6 they are stated in reference 
to the population of scores for a population of subjects that is represented by the m subjects in 
the sample. In Example 9.6, the mean number of correct guesses by the 10 subjects is computed 
with Equation 1.1: X = 1040/10 = 104. The values n = 200, T, = 5,1, = .5 are identical 
to those employed in Example 9.5. Thus, as is the case for Example 9.5, u = (200)(.5) = 100 and 
o = J(200)(.5)(5) = 7.07. When the appropriate values are substituted in Equation 9.10, the 
value z = 1.79 is computed. 


_ 104 - 100 _ 
7.07 


/10 


The obtained value z = 1.79 is evaluated with Table A1. The nondirectional alternative 
hypothesis H,: n, # .5 is not supported, since the value z = 1.79 is less than the tabled critical 
two-tailed value zo; = 1.96. The directional alternative hypothesis H,: x, > .5 is supported 
at the .05 level, since the obtained value z - 1.79 is a positive number that is greater than the 
tabled critical one-tailed value zo, = 1.65. The latter alternative hypothesis is not supported 
at the .01 level, since z = 1.79 is less than the tabled critical one-tailed value zo, = 2.33. The 
directional alternative hypothesis H,: x, < .5 is not supported, since to be supported, the sign 
of z must be negative. 

The above analysis allows one to conclude that the group as a whole scores significantly 
above chance. The latter result can be interpreted as evidence of extrasensory perception if one 
is able to rule out alternative sensory and cognitive explanations of information transmission. 
The reader should be aware of the fact that in spite of the conclusions with regard to the group, 
it is entirely conceivable that the performance of one or more of the subjects in the group is not 
statistically significant. Inspection of the data reveals that the performance of the subject who 
obtains a score of 100 is at chance expectancy. Additionally, the scores of some of the other 
subjects (e.g., 102, 104, 96, 98) are well within chance expectancy.” 

Itis instructive to note that in the case of Example 9.6, if for some reason one is unwilling 
to assume that the variable under study is binomially distributed with a standard deviation of 
o = 7.07, the population standard deviation must be estimated from the sample data. Under the 
latter conditions the single-sample f test (Test 2) is the appropriate test to employ, and the 
following null hypothesis is evaluated: H,: u = 100. If each of the 10 scores in Example 9.6 
are squared and the squared scores are summed, they yield the value XX? - 109728. Employ- 
ing Equation 2.1, the estimated population standard deviation is computed to be § = 13.2. Sub- 
stituting the latter value, along with X = 104, p = 100, and n = 10 in Equation 2.3 yields the 
value t = .96. Since for df = 9, t = .96 falls far short of the tabled critical two-tail .05 value 
fg, = 2.26 and the tabled critical one-tail .05 value ¢,, = 1.83, the null hypothesis is retained. 
This result is the opposite of that reached when Equation 9.10 is employed with the same data. 
The difference between the two tests can be attributed to the fact that in the case of the single- 
sample ¢ test the estimated population standard deviation § = 13.2 is almost twice the value 
of o = 7.07 computed for a binomially distributed variable. 


1.79 


2. Test 9b: The single-sample test for the median There are occasions when the binomial 
sign test for a single sample is employed to evaluate a hypothesis regarding a population median. 
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Specifically, the test may be used to determine the likelihood of observing a specified number 
of scores above versus below the median of a distribution. When the binomial sign test for a 
single sample is used within this context it is often referred to as the single-sample test for the 
median.'^ This application of the binomial sign test for a single sample will be illustrated with 
Example 9.7. Since Example 9.7 assumes 1, = m, = .5 and has the same binomial coefficient 
obtained for Examples 9.1 and 9.2, it yields the same result as the latter examples. 


Example 9.7 Assume that the median blood cholesterol level for a healthy 30-year-old male 
is 200 mg/100 ml. Blood cholesterol readings are obtained for a group consisting of eleven 30- 
year-old men who have had a heart attack within the last month. The blood cholesterol scores 
of the eleven men are: 230, 167, 250, 345, 442, 190, 200, 248, 289, 262, 301. Can one conclude 
that the median cholesterol level of the population represented by the sample (i.e., recent male 
heart attack victims) is some value other than 200? 


Since the median identifies the 50th percentile of a distribution, if the population median 
is in fact equal to 200, one would expect one-half of the sample to have a blood cholesterol 
reading above 200 (i.e., p, = .5), and one-half of the sample to have a reading below 200 (i.e., 
p, = 5) Although the null hypothesis and the nondirectional alternative hypothesis for 
Example 9.7 can be stated using the format Hy: x, = .5 versus H,: m, # .5, they can also be 
stated as follows: H,: 0 = 200 versus H,: 0 # 200. Employing the latter format, the null 
hypothesis states that the median of the population the sample represent equals 200, and the 
alternative hypothesis states that the median of the population the sample represents does not 
equal 200. 

When the binomial sign test for a single sample is employed to test a hypothesis about 
a population median, one must determine the number of cases that fall above versus below the 
hypothesized population median. Any score that is equal to the median is eliminated from the 
data. Employing this protocol, the score of the man who has a blood cholesterol of 200 is 
dropped from the data, leaving 10 scores, 8 of which are above the hypothesized median value, 
and 2 of which are below it. Thus, as is the case in Examples 9.1 and 9.2, we want to determine 
the likelihood of obtaining 8 or more observations in one category (i.e., above the median) if 
there are a total of 10 observations. It was previously determined that the latter probability is 
equal to .0537. As noted earlier, this result does not support the nondirectional alternative 
hypothesis H,: n, # .5. Itjustfalls short of supporting the directional alternative hypothesis H,: m, > .5 
(which in the case of Example 9.7 can also be stated as H,: 0 > 200). 

The data for Example 9.6 can also be evaluated within the framework of the single-sample 
test for the median. Specifically, if we assume a binomially distributed variable for which 
1, = m, = .5 andy = 100, the population median will also equal 100. In Example 9.6, 6 out of 
the 10 subjects score above the hypothesized median value 0 2 100, 3 score below it, and one 
subject obtains a score of 100. After the latter score is dropped from the analysis, 9 scores 
remain. Thus, we want to determine the likelihood that there will be 6 or more observations in 
one category (i.e., above the hypothesized median) if there are a total of 9 observations. Using 
Table A7, we determine that the latter probability equals .2539. Since the value .2539 is greater 
than the required two-tailed .05 probability a/2 = .025, as well as the one-tailed probability 
a = .05, the null hypothesis cannot be rejected regardless of which alternative hypothesis is 
employed. This is in stark contrast to the decision reached when the data are evaluated with 
Equation 9.10. Since the latter equation employs more information (i.e., the interval/ratio scores 
of each subject are employed to compute the sample mean), it provides a more powerful test of 
an alternative hypothesis than does the single-sample test for the median (which conceptualizes 
Scores as categorical data). 

The Wilcoxon signed-ranks test also provides a more powerful test of an alternative 
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hypothesis concerning a population median than does the binomial sign test for a single 
sample/ single-sample test for the median.? This will be demonstrated by employing the 
Wilcoxon signed-ranks test to evaluate the null hypothesis H,: 0 = 200 for Example 9.7. 
Table 9.3 summarizes the analysis. 


Table 9.3 Data for Example 9.7 


Subject X D-X-0 Rank of |D| Signed rank of |D| 
1 230 30 2 2 
2 167 -33 3 -3 
3 250 50 5 5 
4 345 145 9 9 
5 442 242 10 10 
6 190 -10 1 -1 
7 200 0 - - 
8 248 48 4 + 
9 289 89 7 7 

10 262 62 6 6 
11 301 101 8 8 
YR* -51 
YXR-- 4 


The computed Wilcoxon statistic is T= 4, since XR- = 4 (the smaller of the two values XR- 
versus R4) is employed to represent the test statistic. The value T = 4 is evaluated with Table 
A5 (Table of Critical T Values for Wilcoxon's Signed-Ranks and Matched-Pairs Signed- 
Ranks Test) in the Appendix. Employing Table A5, we determine that for n = 10 signed ranks, 
the tabled critical two-tailed .05 and .01 values are Tọ; = 8 and Ty, = 3, andthetabled critical 
one-tailed .05 and .01 values are Tọ; = 10 and Tọ, = 5. Since the null hypothesis can only be 
rejected if the computed value T = 4 is equal to or less than the tabled critical value at the 
prespecified level of significance, we can conclude the following: 

The nondirectional alternative hypothesis H, : 0 # 200 is supported at the .05 level, since 
T= 4is less than the tabled critical two-tailed value Tọ, = 8. Itis not supported at the .01 level, 
since T = 4 is greater than the tabled critical two-tailed value Tọ; = 3. 

The directional alternative hypothesis H,: 0 > 200 is supported at both the .05 and .01 
levels since: a) The data are consistent with the directional alternative hypothesis H,: 0 > 200. 
In other words, the fact that 2. R+ > XR— is consistent with the directional alternative hypothesis 
H,: 0 > 200; and b) The obtained value T = 4 is less than the tabled critical one-tailed values 
To; = 10 and T, = 5. 

The directional alternative hypothesis H,: 8 < 200 is not supported, since it is not con- 
sistent with the data. For the latter alternative hypothesis to be supported, XR- must be greater 
than XR. 

Thus, if Example 9.7 is evaluated with the Wilcoxon signed-ranks test the nondirectional 
alternative hypothesis H,: 0 # 200 is supported at the .05 level, and the directional alternative 
hypothesis H,: 0 > 200 is supported at both the .05 and .01 levels. When the same data are 
evaluated with the binomial sign test for a single sample/single-sample test for the median, 
none of the alternative hypotheses are supported (although the directional alternative hypothesis 
H,: n, > .5 falls just short of being significant at the .05 level). From the preceding it should 
be apparent that the Wilcoxon signed-ranks test (which employs a greater amount of 
information) provides a more powerful test of an alternative hypothesis than does the binomial 
sign test for a single sample/single-sample test for the median. 
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Examination of Example 6.1 (which is identical to Example 2.1) allows us to contrast the 
power of the binomial sign test for a single sample/single-sample test for the median with the 
power of both the single-sample ¢ test and the Wilcoxon signed-ranks test. When the latter 
problem is evaluated with the single-sample ¢ test, the null hypothesis Hy: u = 5 cannot be 
rejected if a nondirectional alternative hypothesis is employed. However, the null hypothesis can 
be rejected at the .05 level if the directional alternative hypothesis H,: u > 5 is employed. 
With reference to the latter alternative hypothesis, the obtained ¢ value is greater than the tabled 
critical ty), value by a comfortable margin. When Example 6.1 is evaluated with the Wilcoxon 
signed-ranks test, the null hypothesis H,: 0 = 5 cannot be rejected if a nondirectional 
alternative hypothesis is employed. When the directional alternative hypothesis H,: 0 > 5 is 
employed, the analysis just falls short of being significant at the .05 level. As noted in the 
discussion of the Wilcoxon signed-ranks test, the different conclusions derived from the two 
tests illustrate the fact that when applied to the same data, the single-sample f test provides a 
more powerful test of an alternative hypothesis than the Wilcoxon signed-ranks test. 

If Example 6.1 is evaluated with the binomial sign test for a single sample/single-sample 
test for the median, it would be expected that it would be the least powerful of the three tests. 
In Example 6.1, 7 of the 10 scores fall above the hypothesized population median 0 = 5 and 3 
scores fall below it. Thus, using the binomial distribution we want to determine the likelihood 
of obtaining 7 or more observations in one category (i.e., above the median) if there are a total 
of 10 observations. Employing either Table A6 or A7, we can determine that for zt, = m, = .5 
and n = 10, the likelihood of 7 of more observations in one category is .1719. Since the latter 
value is well above the required two-tailed .05 value a/2 = .025 and the required one-tailed .05 
value a = .05, the directional alternative hypothesis H,: 0 > 5 is not supported. Thus, when 
compared with the Wilcoxon signed-ranks test, which falls just short of significance, the 
binomial sign test for a single sample does not even come close to being significant. 

The above noted differences between the single-sample ¢ test, the Wilcoxon signed-ranks 
test, and the binomial sign test for a single sample illustrate that when the original data are in 
an interval/ratio format, the most powerful test of an alternative hypothesis is provided by the 
single-sample ¢ test and the least powerful by the binomial sign test for a single sample. As 
noted in the Introduction of the book, most researchers would not be inclined to employ a 
nonparametric test with interval/ratio data unless one had reason to be believe that one or more 
of the assumptions of the appropriate parametric test were saliently violated. In the same respect, 
unless there is reason to believe that the underlying population distribution is not symmetrical, 
it is more logical to employ the Wilcoxon signed-ranks test as opposed to the binomial sign 
test for a single sample to evaluate Example 6.1. 


3. Computing the power of the binomial sign test for a single sample Cohen (1977, 1988) 
has developed a statistic called the g index that can be employed to compute the power of the 
binomial sign test for a single sample when H,: n; = .5 is evaluated. The g index represents 
the distance in units of proportion from the value .50. The equation Cohen (1977, 1988) employs 
for the g index is g — P = .50, where P represents the hypothesized value of the population 
proportion stated in the alternative hypothesis — in this instance it is assumed that the researcher 
has stated a specific value in the alternative hypothesis as an alternative to the value that is 
stipulated in the null hypothesis. 

Cohen (1977; 1988, Ch. 5) has derived tables that allow a researcher, through use of the 
g index, to determine the appropriate sample size to employ if one wants to test a hypothesis 
about the distance of a proportion from the value .5 at a specified level of power. Cohen (1977; 
1988, pp. 147—150) has proposed the following (admittedly arbitrary) g values as criteria for 
identifying the magnitude of an effect size: a) A small effect size is one that is greater than .05 
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but not more than .15; b) A medium effect size is one that is greater than .15 but not more than 
.25; and c) A large effect size is greater than .25. 


VII. Additional Discussion of the Binomial Sign Test for a 
Single Sample 


1. Evaluating goodness-of-fit for a binomial distribution There may be occasions when a 
researcher wants to evaluate the hypothesis that a set a data is derived from a binomially dis- 
tributed population. Example 9.8 will be employed to demonstrate how the latter hypothesis can 
be evaluated with the chi-square goodness-of-fit test. 


Example 9.8 An animal biologist states that the probability is .25 that while in captivity a 
female of a species of Patagonian fox will give birth to an albino pup. Records are obtained on 
100 litters each comprised of six pups (which is the modal pup size for the species) from zoos 
throughout the world. In 14 of the litters there were O albino pups, in 30 litters there was 1 
albino pup, in 35 litters there were 2 albino pups, in 18 litters there were 3 albino pups, in 2 
litters there were 4 albino pups, in 1 litter there were 5 albino pups, and in Q litters there were 
6 albino pups. Is there a high likelihood the data represents a binomially distributed population 
with x, = .25? 


The null and alternative hypotheses that are evaluated with the chi-square goodness-of-fit 
test in reference to Example 9.8 can either be stated in the form they are presented in Section III 
of the latter test (i.e., Hj: 0, = ©, forall cells; H; o; + e; for at least one cell), or as follows. 


Null hypothesis H): The sample is derived from a binomially distributed population, with 
n, = 25. 


Alternative hypothesis H,: The sample is not derived from a binomially distributed popula- 
tion, with rt, = .25. This is a nondirectional alternative hypothesis. 


The analysis of Example 9.8 with the chi-square goodness-of-fit test is summarized in 
Table 9.4. The latter table is comprised of k = 7 cells/categories, with each cell corresponding 
to the number of albino pups in a litter. The second column of Table 9.4 contains the observed 
frequencies for albino pups. The expected frequency for each cell was obtained by employing 
Equation 8.1. Specifically, the value 100, which represents the total number of litters/obser- 
vations, is multiplied by the appropriate binomial probability in Table A6 for a given value of 
x when n = 6 (the number of pups in a litter) and x, = .25. The latter binomial probabilities are 
as follows: x = 0 ( p = .1780); x = 1 ( p = .3560); x = 2 ( p = .2966); x = 3 ( p = .1318); 
x = 4 ( p = .0330); x = 5 ( p = .0044); x = 6 ( p = .0002). Thus, if the data are binomially 
distributed with x, = .25, the following probabilities are associated with the number of albino 
pups in a litter comprised of 6 pups: 0 albino pups: p =.1780; 1 albino pup: p = .3560; 2 albino 
pups: p = .2966; 3 albino pups: p = .1318; 4 albino pups: p =.0330; 5 albino pups: p = .0044; 
and 6 albino pups: p 2.0002. The expected frequencies in Column 3 of Table 9.4 are the result 
of multiplying each of the aforementioned binomial probabilities by 100. To illustrate the 
computation of an expected frequency, the value 17.80 is obtained for Row 1 (0 albino pups) as 
follows: E, = (100)(.1780) = 17.80. 

Employing Equation 8.2, the value X? = 5.65 is computed for Example 9.8. Since there 
are k = 7 cells and w = 0 parameters that are estimated, employing Equation 8.7, the degrees of 
freedom for the analysis are df = 7 - 1 - 02 6. Employing Table A4, we determine that for 
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Table 9.4 Chi-Square Summary Table for Example 9.8 


2 (O; > Ey 

Cell 0, E, (0,- E) (0, - E) oe 
0 14 17.80 —3.80 14.44 .81 
1 30 35.60 —5.60 31.36 .88 
2 35 29.66 5.34 28.52 .96 
3 18 13.18 4.82 23.23 1.76 
4 2 3.30 —1.30 1.69 51 
5 1 44 56 31 71 
6 0 .02 —.02 .0004 .02 
Xo, =100  XE,-100 YC, - E) = 0 yi = 5.65 


df = 6 the tabled critical .05 and .01 values are Yos - 12.59 and Y» - 16.81. Since the 
computed value y? = 5.65 is less than both of the aforementioned values, the null hypothesis 
cannot be rejected. Thus, the analysis does not indicate that the data deviate significantly from 
a binomial distribution. 

It should be noted that if, instead of stipulating the value a, = .25, the population 
proportion had been estimated from the sample data, the value of df is reduced by 1 (i.e., 
df27—1-125) The latter is the case, since an additional degree of freedom must be 
subtracted for the parameter that is estimated. In Example 9.8 the proportion of albino pups is 
p, = n/N = 167/600 = .278 (where n, is the total number of albino pups, and N is the total 
number of pups in the 100 litters (NV = (6)(100) = 600)). The value n, = 167 is computed as 
follows: a) In each row of Table 9.4, multiply the value for the number of albino pups in Column 
1 by the observed frequency for that number of albino pups in Column 2; and b) Sum the seven 
products obtained in a). The latter sum will equal the total number of albino pups and, if that 
value is divided by the total number of pups (N), it yields the value p,. The latter value will 
represent the best estimate of the population proportion 7, . 

It is important to note that if the value .278 is employed to represent 7, the expected 
frequencies will be different from those recorded in Column 3 of Table 9.4. Since the value 
1, = .278 is not listed in Table A6, the latter table cannot be employed to determine the 
binomial probabilities to employ in computing the expected frequencies. Consequently we 
would have to employ Equation 9.3 to compute the appropriate binomial probabilities for the 
values of x (i.e., 0, 1, 2, 3, 4, 5, and 6) when n = 6,n, = .278 and mz, = 1 - .278 = .722, and 
then use the resulting binomial probabilities to compute the expected frequencies (once again by 
multiplying each probability by 100). 


VIII. Additional Example Illustrating the Use of the Binomial 
Sign Test for a Single Sample 


Example 9.9 employs the binomial sign test for a single sample in a case where the value of 7, 
stated in the null hypothesis is close to 1. It also illustrates that the continuity-corrected version 
of the z test for a population proportion can provide an excellent approximation of the 
binomial distribution, even if the value of z, is far removed from .5. 


Example 9.9 A biologist has a theory that 90% of the people who develop a rare disease are 


males and only 1096 are females. Of 10 people he identifies who have the disease, 7 are males 
and 3 are females. Do the data support the biologist's theory? 
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Since the information given indicates that we are dealing with a binomially distributed 
variable with x, = .9 and m, = .1, the data can be evaluated with the binomial sign test for a 
single sample. Based on the biologist's theory, the null hypothesis and the nondirectional 
alternative hypothesis are as follows: H,: 1, = .9 versus H,: n, # .9. The data consist of the 
number of observations in the two categories males versus females. The respective proportion 
of observations in the two categories are: p, - 7/10 - .7 and p, - 3/10 - .3. Thus, given 
that z, - .9, we want to determine the likelihood of 7 or fewer observations in one category if 
there are a total of 10 observations. Note that since p, = .7 is less than x, = .9, a value that is 
more extreme than x = 7 will be any value that is less than 7. 

Since m = m, = .9 is not listed in either Table A6 or A7, we employ for z the prob- 
ability value listed for m,, which in the case of Example 9.9 is x, -.1. The probability of x 
being equal to or less than 7 (i.e., the number of observations in Category 1) if z, = .9, will be 
equivalent to the probability of x being equal to or greater than 3 (which is the number of 
observations in Category 2) if x, = .1. From Table A7 it can be determined that the latter 
probability (which is in the cell that is the intersection of the row x = 3 and the column z = .1) 
is equal to .0702. The same value can be obtained from Table A6 by adding the probabilities 
for all values of x equal to or greater than 3 (for n = .1). Thus, the probability of 7 or fewer males 
if there is a total of 10 observations is .0702.'° Since the latter value is greater than the two- 
tailed .05 value o/2 = .025 and the one-tailed .05 value a = .05, neither the nondirectional 
alternative hypothesis H,: t, * .9 nor the directional alternative hypothesis which is consistent 
with the data (H,: x, < .9) is supported. In other words, p, = .7 (the observed the proportion 
of males) is not significantly below the hypothesized value x, = .9. In the same respect p, = .3, 
the observed proportion of females, is not significantly above the expected value of x, = .1. 

When the z test for a population proportion is employed to evaluate Example 9.9, Equa- 
tion 9.6 (which does not employ the correction for continuity) yields the following result: 
z= (7 - 9yy[C9)C1)/10 = -2.11. Equation 9.8 (the continuity-corrected equation) has the 
identical denominator as Equation 9.6, but the numerator is reduced by 1/[(2)(10)] = .05, thus 
yielding the value z 2 — 1.58. Since the absolute value z = 2.11 is greater than the tabled critical 
two-tailed .05 value zo; = 1.96 and the tabled critical one-tailed .05 value zo, = 1.65, without 
the correction for continuity both the nondirectional alternative hypothesis H,: x, + .9 and the 
directional alternative hypothesis H,: x, < .9 are supported at the .05 level. When the cor- 
rection for continuity is employed, the obtained absolute value z = 1.58 is less than both of 
the aforementioned tabled critical values and, because of this, regardless of which alternative 
hypothesis is employed, the null hypothesis cannot be rejected. The latter conclusion is 
consistent with the result obtained when the exact binomial probabilities are employed. Thus, 
even in a case where the value of x, is far removed from .5, the continuity-corrected equation 
for the z test for a population proportion appears to provide an excellent estimate of the exact 
binomial probability. 


IX. Addendum 


Discussion of additional discrete probability distributions In this section a number of other 
discrete probability distributions are described, some of which are related to the binomial distri- 
bution. The following distributions will be discussed: a) The multinomial distribution; b) The 
negative binomial distribution; c) The hypergeometric distribution; d) The Poisson 
distribution; and d) The matching distribution. 


1. The multinomial distribution Earlier in this chapter it was noted that: a) The binomial dis- 
tribution is a special case of the multinomial distribution; b) The multinomial distribution is 
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an extension of the binomial model to two or more categories; and c) The multinomial distri- 
bution is the exact probability distribution that the chi-square goodness-of-fit test is employed 
to estimate (for two or more categories). 

The conditions that describe the model for the multinomial distribution are that on each 
of n independent trials there are k possible outcomes. The probability on any trial that an 
outcome will fall in the i” category is m,. Thus, the probabilities for the k categories are 
T4, T5 ,%, , ... ,%,. As is the case with the binomial distribution, sampling with replacement 
(which is defined in Endnote 1) is assumed for the multinomial distribution. 

Equation 9.11 is a multinomial generalization of Equation 9.3 (the binomial equation for 
computing the probability for a specific value of x) that can be employed when there are two or 
more categories. Equation 9.11 computes the probability that in n independent trials, n, out- 
comes will fall in Category 1, n, outcomes will fall in Category 2, n, outcomes will fall in 
Category 3,..., and n, outcomes will fall in Category k. The term n! / (n!, nl, n!, ... n!,) on the 
right side of Equation 9.11 is referred to as the multinomial coefficient. The multinomial analog 
of the binomial expansion (x, + 75)" (discussed in Endnote 6) is the multinomial expansion 
described by the general equation (t, + m, + 1, +... + m). 


(Equation 9.11) 


P(n, ny, ny os n) = Eee! Yo (me?) ... (5 


When k = 2, Equation 9.11 reduces to Equation 9.3. Thus, using Equation 9.11 we can 
compute the binomial probability .0439 for x = 8 computed for Examples 9.1/9.2 in Section IV. 


er cest. TUN 
P(n, = 8 n, = 2) = —— (n) (n^ = sg O0 = 0439 


n,!n, 
Examples 9.10, 9.11, and 9.12 will be employed to illustrate the application of Equation 
9.11 to compute a multinomial probability. 


Example 9.10 An automobile dealer gets a delivery of ten cars. The company that manu- 
factures the cars only delivers cars of the following three colors: silver, red, and blue. Assume 
that the likelihood of a car being silver, red, or blue is as follows: Silver (.2), Red (.3), and Blue 
(.5). What is the probability that of the ten cars delivered, five will be silver, four will be red, and 
one will be blue? 


In Example 9.10 there are n = 10 observations/trials which correspond to the total of ten 
cars, and there are the following three categories: Category 1 — Silver cars; Category 2 — Red 
cars; Category 3 — Blue cars. Thus, n) = .2, m, = .3,and m, = .5. Since we are asking what 
the probability is that there will be five silver cars, four red cars, and one blue car, we can 
stipulate the values n, = 5, n, = 4, and n, = 1. Substituting the appropriate values in 
Equation 9.11, we determine that the probability of the delivery being comprised of five silver 
cars, four red cars, and one blue car is .0016. 


= = = 5 4 i pe 
P(n, = 5, n, = 4, n, = 1) = Grands 2)°(.3)(.5)! = .0016 
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Example 9.11 Assume that during an official time at bat a major league baseball player has 
a .65 chance of making out, a .18 chance of hitting a single, a .07 chance of hitting a double, a 
.01 chance of hitting a triple, and a .09 chance of hitting a home run. If, during a game, the 
player has six official at bats, what is the likelihood that he will make out in all six at bats? 


In Example 9.11 there are n = 6 observations/trials which correspond to the six at bats, and 
there are the following five categories: Category 1 — Out; Category 2 — Single; Category 3 — 
Double; Category 4 — Triple; Category 5 - Home run. Thus, 1, = .65, m, = .18, m, = .07, 
T, = Ol,m, = .09 . Since we are asking what the probability is that there will be six outs, zero 
singles, zero doubles, zero triples, and zero home runs, we can stipulate the values n, = 6, 
n, = 0,n, = 0,n, = 0,andn, = 0. Substituting the appropriate values in Equation 9.11, we 
determine that the probability of the player making out six times is .075. 


6! 


P(n, = 6,n, =0,n, -0,n, = 0, n; -0) = (GOOD 


(.65)9(.18)9(.07)9(.01)9(.09? = .075 


Example 9.12 A bird watcher spends the day searching for a particular species of bird whose 
beak can be either red or white and whose tail can be either long or short. Assume that the 
likelihood of a bird having a red beak and long tail is .10, the likelihood of a bird having a red 
beak and a short tail is .30, the likelihood of a bird having a white beak and long tail is 40, and 
the likelihood of a bird having a white beak and short tail is .20. If the bird watcher spots three 
birds of the species in question, what is the probability of observing three birds that conform to 
the characteristics noted in Table 9.5? 


Table 9.5 Data for Example 9.12 


Tail Size 
Long Tail Short Tail 
Red 0 1 
Beak Color White 1 1 


In Example 9.12 there are n = 3 observations/trials that correspond to the total of three 
birds, and there are the following four categories: Category 1 — Red beak/Long Tail; Category 
2 — Red beak/Short tail; Category 3 — White beak/Long Tail; Category 4 — White beak/Short Tail. 
Thus, zt, = .1, m, = 3, m, = .4,and z, = 2. Since we are asking what the probability is 
that there will be zero birds with a Red beak/Long tail, one bird with a Red beak/Short Tail, one 
bird with a White beak/Long tail, and one bird with a White beak/Short tail, we can stipulate the 
values n, = 0, n, = 1, n, = l,and n, = 1. Substituting the appropriate values in Equation 
9.11 we determine that the probability of the bird watcher sighting the three birds noted in Table 
9.5 is .144. 


P(n, = 0,n, = 1, n = 1, n, = 1) = — 9! (063)(4)(2)! = 144 


(0501905019) 


In point of fact, if the bird watcher spots three birds, there are 20 possible configurations 
of beak color and tail length (which we assume are independent of one another) for which we can 
compute a multinomial probability. The 20 configurations are summarized in Table 9.6 along 
with their probabilities. Note that the probability value associated with each of the configurations 
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is quite low, and that the sum of the probability values for all of the configurations adds up to 1. 
If analogous tables were constructed for Examples 9.10 and 9.11 (in which there are, 
respectively, ten and six observations), the number of possible configurations would be 
substantially larger than the 20 configurations for Example 9.12 (which has only three 
observations). 

The values recorded in Columns 2-5 of Table 9.6 represent the number of birds who 
possess the beak and tail characteristic noted at the top of a column. The configuration 
represented in Table 9.5, for which the probability .144 is computed, corresponds to 
Configuration 19 in Table 9.6. 


Table 9.6 Color/Tail Configuration for n = 3 Birds 


Configuration Red Beak/ Red Beak/ White Beak/ White Beak/ Multinomial 
Long Tail Short Tail Long Tail Short Tail Probability 

1 3 0 0 0 .001 
2 0 3 0 0 .027 
3 0 0 3 0 .064 
4 0 0 0 3 .008 
5 2 1 0 0 .009 
6 0 0 2 1 .096 
7 2 0 1 0 .012 
8 0 2 0 1 .054 
9 1 2 0 0 .027 
10 0 0 1 2 .048 
11 1 0 2 0 .048 
12 0 1 0 2 .036 
13 0 2 1 0 .108 
14 2 0 0 1 .006 
15 0 1 2 0 144 
16 1 0 0 2 .012 
17 1 1 1 0 .072 
18 1 1 0 1 .036 
19 0 1 1 1 144 
20 1 0 1 1 .048 
Sum = 1.000 


2. The negative binomial distribution The negative binomial distribution is another discrete 
probability distribution that can be employed within the context of evaluating certain 
experimental situations. As is the case with the binomial distribution, the model for the negative 
binomial distribution assumes the following: a) Ina set of n independent trials there are only two 
possible outcomes; and b) Sampling with replacement. If we identify the two outcomes as 
Category 1 and Category 2, the negative binomial distribution allows us to determine the 
probability that exactly n trials will be required to obtain x observations in Category 1. If m 
represents the likelihood that an observation will fall in Category 1, and m, represents the 
likelihood that an observation will fall in Category 2, the probability that exactly n trials will be 
required to obtain x observations in Category 1 is computed with Equation 9.12. 


P(x) = [” 5 IE mo^ (Equation 9.12) 
"M 


Although there are special tables prepared by Williamson and Bretherton (1963) for obtain- 
ing negative binomial probabilities, the latter values can also be determined through use of tables 
for the binomial distribution. Miller and Miller (1999) note that the probabilities in Table A6 can 
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be employed to determine a negative binomial probability computed with Equation 9.12, by 
multiplying the individual probability associated with x in Table A6 by (x/n). Guenther (1968) 
notes that the negative binomial probability that n or fewer trials will be required to obtain x 
Observations in Category 1 is equivalent to the cumulative binomial probability (in Table A7) 
for x or more observations in n trials. Thus, in Table A7, for the appropriate value of 1, , one 
would find the cumulative probability associated with x for a given value of n. 

Examples 9.13 and 9.14 will be employed to illustrate the application of Equation 9.12 to 
compute a negative binomial probability. 


Example 9.13 The likelihood on any trial that a copy machine will print an acceptable copy 
is only .25. What is the probability that exactly 12 copies will have to be printed before five 
acceptable copies are printed by the machine? What is the probability that the fifth acceptable 
copy will be printed on or before the twelfth trial? What is the probability that the fifth 
acceptable copy will be printed after the twelfth trial? 


Based on the information that has been provided in Example 9.13, we can stipulate the 
following values that we will employ in Equation 9.12: n = 12, x 25, m, = .25,and m, = .75. 
Substituting the appropriate values in Equation 9.12, we compute the probability p = .04317. 
The latter value can also be obtained from Table A6 by doing the following: a) Go to the section 
for n = 12; b) Find the probability in the cell that is the intersection of the row x = 5 and the 
column x = .25 (i.e., obtain the probability of five observations in 12 trials) — the latter value 
is .1032; and c) Multiply the probability obtained in b) (i.e., .1032) by (x/n). The resulting value 
will represent the likelihood of requiring exactly 12 trials to print five acceptable copies. Thus, 
(.1032)(5/12) = .043. 


12 — 1 


Ries 


Jesas = | 4 ass; - 04317 


Example 9.13 also asks for the probability that the fifth acceptable copy will be printed on 
or before the twelfth trial, and the probability that the fifth acceptable copy will be printed after 
the twelfth trial. As noted earlier, the probability that the x" acceptable copy will be printed on 
or before the n" trial will correspond to the cumulative binomial probability for n = 12, x = 5, 
and x, = .25 in Table A7 (which contains the probabilities computed with Equation 9.5). In 
the latter table the appropriate cumulative probability is .1576, which, in the case of Example 
9.13, represents the likelihood that the fifth acceptable copy will be printed on or before the 
twelfth trial. We can also compute the probability that the fifth acceptable copy will be printed 
on or before the twelfth trial by employing Equation 9.12 with all values of n between five 
(which is the fewest possible trials in which five acceptable copies can be printed) and twelve, 
and summing the individual probabilities. When the values five through twelve are substituted 
for x in Equation 9.12, the following probability values are obtained which sum to .1576 (the 
minimal difference between the sum of the listed values and .1576 is due to rounding off error): 
x=5 ( p = .00098); x 26(p 2.003675); x = 7 (p = .00827); x = 8 ( p 2.01447; x =9 (p= 
02171); x = 10 ( p = .02930); x = 11 (p = .03663); x = 12 ( p = .04317). 

The answer to the last question posed in Example 9.13 (the likelihood that the fifth 
acceptable copy will be printed after the twelfth trial) is obtained simply by subtracting the value 
.1576 from 1. Thus, 1 — .1576 = .8424 is the likelihood the fifth acceptable copy will be printed 
after the twelfth trial. 

Equation 9.13 can be employed to compute the expected value (1) for a negative binomi- 
ally distributed variable. u represents the expected number of trials to obtain x outcomes in 
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Category 1. The standard deviation of a negative binomially distributed variable is computed 
with Equation 9.14. In the case of Example 9.13, the values u = 20 and o = 7.75 are computed. 
Thus, if we conducted an infinite number of experiments with the copier, and in each experiment 
we printed copies until five were acceptable, the average value for n (i.e., the average number 
of trials required to obtain five acceptable copies) will be 20, and the standard deviation of the 
sampling distribution will equal 7.75. 


w= = 20 (Equation 9.13) 
n 25 
TU 
240579 oV OCT) -775 (Equation 9.14) 
n, (.25) 


Example 9.14 7e likelihood that a basketball player will put the ball in the basket each time 
he shoots is .25. What is the probability that the player will have to take exactly 12 shots before 
making five baskets? What is the probability that the fifth successful shot will be made on or 
before the twelfth shot? What is the probability that the fifth successful shot will be made after 
the twelfth shot? 


Since the data for Example 9.14 are identical to that employed for Example 9.13, it yields 
the same probabilities. Thus, the probability of requiring exactly 12 shots to make five baskets 
is .043. The probability that the fifth successful shot will occur on or before the twelfth shot is 
.1576. The probability that the fifth successful shot will occur after the twelfth shot is .8424. 


3. The hypergeometric distribution The model for the hypergeometric distribution is 
similar to the model for the binomial distribution except for one critical difference — the latter 
being that in the hypergeometric model sampling without replacement (which is defined in 
Endnote 1) is assumed. In the hypergeometric model, in a set of n trials there are two possible 
outcomes (to be designated Category 1 versus Category 2), and because sampling without 
replacement is assumed, the outcome on each trial will be dependent on the outcomes of previous 
trials. The latter will be the case since, if there are two categories, the probability of obtaining 
an outcome in a given category will change from trial to trial and, on any trial, the value of the 
probabilities will be a function of the number of potential observations in each category that are 
still available to be selected. 

Equation 9.15 is employed to compute a hypergeometric probability. The equation assumes 
there is a population comprised of N objects, each of which falls into one of two categories. In 
the population there are N, objects in Category 1, and N, objects in Category 2. Let us assume 
that we want to select a sample of n objects from the population employing sampling without 
replacement. We select x objects from Category 1, and (n — x) objects from Category 2. 
Equation 9.15 allows us to compute the probability that we will select exactly x objects from 
Category 1 and (n — x) objects from Category 2. 


PG) =- +0 x (Equation 9.15) 
N 
n 
Examples 9.15 and 9.16 will be employed to illustrate the computation of a hypergeometric 
probability with Equation 9.15. 


© 2000 by Chapman & Hall/CRC 


Example 9.15 What is the probability of selecting two boys and one girl from a class of nine 
students that consists of five boys and four girls? 


Example 9.16 A researcher predicts that people who suffer from migraine headaches who take 
500 milligrams of vitamin E daily are more likely to show a remission of symptoms than patients 
who don't take the vitamin. Nine patients with a history of migraines participate in a study. Five 
of the patients take 500 milligrams of vitamin E daily for a period of six months, while the other 
four patients, who comprise a control group, do not take a vitamin E supplement. At the con- 
clusion of the study, two patients in the vitamin E group exhibit a remission of symptoms, while 
one person in the control group exhibits a remission of symptoms. What is the probability of this 
outcome? 


In both of the above examples we are dealing with a population that is comprised of N 2 9 
students/patients. There are N, - 5 students/patients in Category 1, and N, = 4 
students/patients in Category 2. We let x = 2 represent the two boys/patients in the vitamin E 
group who exhibits symptom remission. Thus, n — x = 3 — 2 = 1 represents the one girl/patient 
inthe control group who exhibits symptom remission. Employing Equation 9.15, the value .4762 
is computed below for the probability of selecting two boys and one girl, or two patients in the 
experimental group exhibiting remission and one patient in control group exhibiting remission. 


BIB 
P(x = 2) = ALL = 4162 


3 


In the case of Example 9.16, we want to determine whether there is a significant difference 
between the response of the vitamin E group and the control group. The null hypothesis is that 
there is no difference between the two groups. In order to evaluate the null hypothesis, we must 
compute the chance likelihood/probability of obtaining an outcome equal to or more extreme 
than the outcome observed in the study. In order to determine the latter, we must compute the 
hypergeometric probabilities for all possible outcomes involving the value n 2 3. Specifically, 
the following four outcomes can be obtained in which three out of a total of nine patients exhibit 
aremission of symptoms: a) All three patients who exhibit remission are in the vitamin E group; 
b) Two of the three patients who exhibit remission are in the vitamin E group, and the remaining 
patient is in the control group; d) Two of the three patients who exhibit remission are in the 
control group, and the remaining patient is in the vitamin E group; and d) All three patients who 
exhibit remission are in the control group. Note that Outcome b noted above corresponds to the 
Observed outcome in Example 9.16, and that the sum of the probabilities for the four outcomes 
is equal to 1 (due to rounding off error, the four probabilities sum to .9999). The hypgeometric 
probabilities for all four possible outcomes when n = 3 are noted below. 


(aIla) (a(i 
a) Pœ = 3) = VAS = 1190 b) PRS 2) = VE = 4762 


3 3 


GG (I) 
e uices, SUAE -= 3571 dj Pc 0) = 03 = 0476 


(3 (5 
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If a directional alternative hypothesis is employed (1.e., a one-tailed analysis is conducted), 
the null hypothesis can be rejected if the following conditions are met: a) The obtained 
difference is in the predicted direction; and b) The probability of obtaining a value equal to or 
more extreme than x = 2 is equal to or less than the value of a (which we will assume is .05). The 
data are consistent with the directional alternative hypothesis which predicts a better response 
by the vitamin E group. However, Outcome a is in the same direction and more extreme than 
Outcome b. Thus, if we add the probabilities for Outcomes a and b, we obtain .1190 + .4762 = 
.5952. Since the latter value is greater than a = .05, the null hypothesis cannot be rejected. Thus, 
we cannot conclude that vitamin E had a therapeutic effect. 

If a nondirectional alternative hypothesis is employed (i.e., a two-tailed analysis is con- 
ducted), the null hypothesis can be rejected if the probability of obtaining a value equal to or 
more extreme than x = 2 is equal to or less than the prespecified level of significance. In the case 
of a two-tailed analysis, however, we take into account more extreme outcomes in either 
direction. In actuality, all of the outcomes are more extreme than Outcome b. The latter is the 
case, since in Outcomes a, c, and d, the proportion of subjects in the group that exhibits 
remission for two or more subjects is higher than the proportion 2/5 = .40 for the vitamin E group 
in Outcome b. Because all of the other outcomes are more extreme than the observed outcome, 
the null hypothesis cannot be rejected. Use of the hypergeometric distribution in hypothesis 
testing is discussed in greater detail within the framework of the Fisher exact test (Test 16c) 
(which is discussed in Section VI of the chi-square test for r x c tables (Test 16)). 

Equation 9.16 can be employed to compute the expected value (u) of a hypergeometrically 
distributed random variable (u is generally only computed for Category 1). In the case of 
Examples 9.15/9.16, the latter value is computed to be u = 1.67. The value u = 1.67 represents 
the expected number of outcomes in Category 1, when n=3. If N, = 4 is employed in Equation 
9.16 in place of N,, the value u = 1.33 is computed, which represents the expected number of 
outcomes in Category 2, when n = 3. Note that when p = 1.67 is subtracted from 3, the resulting 
value is u = 1.33. 


nN. 
jer eO I dg (Equation 9.16) 


N 9 


Equation 9.17 can be employed to compute the expected value of the standard deviation (6) 
of a hypergeometrically distributed random variable (o is generally only computed for Category 
1). In the case of Examples 9.15/9.16, the latter value is computed to be o 2.745. The identical 
value will be obtained for o if N, is employed in Equation 9.17 in place of N}. 


(Equation 9.17) 


WU sd GIC 3] o 


When the value of N is very large relative to the value of n, the binomial distribution 
provides an excellent approximation of the hypergeometric distribution. This is the case, since, 
under the latter conditions, the differences between the sampling without replacement model and 
sampling with replacement model are minimized. To further clarify the relationship between the 
binomial and hypergeometric distributions, in Equation 9.16 the element N,/N represents the 
proportion of cases in Category 1 in a dichotomous distribution. Thus, if we represent the 
element N,/N with the notation 7, (since it represents the same thing m, is employed to 
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represent for the binomial distribution), an alternative way of writing Equation 9.16 is u = np, 
which is the same as Equation 9.1, the equation for computing the expected value of a binomially 
distributed variable. 

Now let us turn our attention to Equation 9.17. Once again we can employ m, to represent 
the element N,/N. The element 1 - (Nj/N) represents the proportion of cases in Category 2, 
and thus we can represent it with the notation m, (because n, = 1 - m,),since it represents the 
same thing 2, represents for the binomial distribution. If the value of n is very small relative 
to the value of N, the element(N - n)/(N - 1) will approach the value 1. If the latter is true, 
Equation 9.17 can be written as o = /nm, m, , which is identical to Equation 9.2, the equation for 
computing the standard deviation of a binomially distributed variable. 

Daniel and Terrell (1995) state that as a general rule, in order to get a reasonable binomial 
approximation for a hypergeometrically distributed variable, the value of N should be at least ten 
times as large as the value of n. To illustrate the binomial approximation, let us consider the 
following values for a hypergeometrically distributed variable: N = 40, N, = 10 , N, = 30, 
n=3,x=1. When the hypergeometric probability for x = 1 is computed below, we obtain 


the value p = .4403. 
| N | a 
Pg =1) = AE), = 4403 
40 
3 
Employing Table A6, we can determined the binomial probability for x = 1 when n = 3, 
1,- .25 (since N,/N = 10/40 = .25), and m, = .75 (since N,/N = 30/40 = .75). The latter 


value, which is also computed below with Equation 9.3, is .4219 (which is very close to the exact 
hypergeometric probability of .4403 computed above). 


Pœ = 1) = ("} (n (9 = Jesas = 4219 


4. The Poisson distribution With the exception of the binomial distribution, the Poisson dis- 
tribution is probably the most commonly employed discrete probability distribution. The latter 
distribution is named after the French mathematician Simeon Denis Poisson, who described it 
in the 1830s (although Zar (1999) notes that it was described previously by another 
mathematician, Abraham de Moivre, in 1718). The Poisson distribution (which Pagano and 
Gauvreau (1993) note is sometimes referred to as the distribution of rare events) is most 
commonly employed in evaluating a distribution of random events that have a low probability 
of occurring. The Poisson distribution is employed most frequently in situations where there is 
an interest in the number of times a particular event occurs within a specified period of time or 
within a specific physical environment. Feller (1968) and Guenther (1968) cite the following as 
examples of random events whose behavior is consistent with the Poisson distribution: a) The 
number of automobile accidents per month in a large city; b) The number of meteorites that land 
in areas of fixed size in a desert; c) The number of typographical errors per page in a manuscript; 
d) The number of telephone calls a person receives in a 24-hour period; e) The number of 
defective products manufactured daily on an assembly line; f) The number of atoms per second 
that disintegrate from a radioactive substance; and g) The number of bombs that hit specified 
blocks of equal area in London during World War II. 

The model for the Poisson distribution assumes that within a given time period or within 
a give area, the likelihood of a random event occurring (with the occurrence of events being 
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independent of one another) is very low. Consequently if we have an infinite number of time 
periods or an infinite block of areas, the likelihood of more than one event occurring within any 
time period or area is very small (although, theoretically, there is no limit on the number of 
events that can occur within a specific time period/area). 

A Poisson distribution has what is referred to as a parameter (or sometimes referred to as 
a rate parameter), which is represented by the notation À (which is the lower case Greek letter 
lambda). The parameter X is the average number of events that occur over a given period of time 
or within a specified area of space. X also happens to be the variance of a Poisson distributed 
variable. Thus, Equations 9.18 and 9.19 define the mean/expected value (u) and variance of a 
Poisson distribution 


HOA (Equation 9.18) 
o = (Equation 9.19) 


It can be seen from inspection of Equations 9.18 and 9.19 that in a Poisson distribution 
u = o”,and because of the latter it is often assumed that any distribution where u = o? is likely 
to be Poisson. Equation 9.20 is the general equation for computing a probability for the Poisson 
distribution. The latter equation computes the probability of x events occurring in a given period 
of time or within a specified area of space. Since x is a measure of a discrete random variable, 
any value obtained for x will have to be an integer number. 


x 


-H 
P(X =x = £ E (Equation 9.20) 
x: 





Example 9.17 will be employed to illustrate the computation of a Poisson probability with 
Equation 9.20. 


Example 9.17 The traffic bureau of a Midwestern city claims that on the average two accidents 
occur per day, and that the frequency distribution of accidents conforms to a Poisson distri- 
bution. If the latter is true, what are the probabilities for the following numbers of accidents 
occurring per day: 0, 1, 2, 3, 4, 5, 6, more than 7? 


Given that the average equals two, we can say that A = y = o? = 2. Substituting the 
values u = 2 and e = 2.71828 in Equation 9.20, the probabilities are computed below for the 
values of x noted in Example 9.17." 


"wy _ (2.71828)? (2) 


P(x = 0) =£ = 1353 
x! 0! 


This result tells us that the likelihood of zero traffic accidents occurring is .1353. 


"y _ (2.71828) 72)! 


P(x =1)=£ = 2707 
x! 1! 


This result tells us that the likelihood of one traffic accident occurring is .2707. 


"wy (Q71828)?Qy 


P(x =2)=£ = 2707 
x! 2! 


© 2000 by Chapman & Hall/CRC 


This result tells us that the likelihood of two traffic accidents occurring is .2707. 


"y _ Q.71828)?Qy 


P(x-3)-£4 = 804 
x! 3! 


This result tells us that the likelihood of three traffic accidents occurring is .1804. 


"uy _ (2.71828)? (2) 


P(x=4)=£ = 0902 
x! 4! 


This result tells us that the likelihood of four traffic accidents occurring is .0902. 


Hu _ (2.71828)? (2) 


P(x=5)=£ = 0361 
x! 5! 


This result tells us that the likelihood of five traffic accidents occurring is .0361. 


Hye (271828)? 


P(x = 6)=£ = 0120 
x! 6! 


This result tells us that the likelihood of six traffic accidents occurring is .0120. 

The likelihood of seven or more accidents occurring is the sum of the probabilities for zero 
through six accidents (which is .9954) subtracted from 1. Thus, the likelihood of seven or more 
accidents occurring is 1 — .9954 =.0046. 

Under certain conditions the Poisson distribution can be employed to approximate the 
binomial distribution. When the latter is done it can facilitate the often tedious computations that 
are involved in determining binomial probabilities. The optimal conditions for approximating 
the binomial distribution with the Poisson distribution are when n is large and the value of 7, is 
very small. Under the latter conditions the value of a, will be very close to 1, since 

1 - m, = m. In such a case if we set the value of 2, equal to 1, Equation 9.2 (the equation for 

computing the standard deviation of a binomially distributed variable) reduces to 
o = (nmm, = inm . If the latter is true, o? - nT. Since the expected value of a binomially 
distributed variable is u = nz, , under these conditions u and o? are identical, which is the case 
with a variable that conforms to a Poisson distribution. Thus, both u and o? may be represented 
by the parameter A. 

Consequently, we can say that when n is large and 7, is very small, the relationship noted 
below is true (the notation = means approximately). The left side of the relationship is Equation 
9.3 (the equation for computing the likelihood of a specific value of x for a binomially 
distributed variable), and the right side of the relationship is Equation 9.20 (the equation for 
computing the likelihood of a specific value of x for a variable that has a Poisson distribution). 


gh 
| ") a yn? = LE 


To illustrate the Poisson approximation of the binomial distribution, consider Example 9.18. 
Example 9.18 Assume that there is a .03 probability of a specific microorganism growing in 
a culture. What is the likelihood that in a batch of 200 cultures five of the cultures will contain 


the microorganism? 
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Employing Equation 9.3, we compute the binomial probability for n = 200, x = 5, 
1, = O3,andz, = .97. The obtained value is p = .1622, which is the likelihood that in a batch 
of 200 cultures five will contain the microorganism. 


Px-5)- ("} (Y m)" = My (.03)5(.97)!5 = .1622 


The Poisson estimate of the binomial probability is determined as follows. First, employ- 
ing Equation 9.1 we compute the expected value u, which is u = 6. We then employ the latter 
value along with x 2 5 in Equation 9.20, and compute the probability .1606, which is the 
likelihood that in a batch of 200 cultures five will contain the microorganism. Note that the value 
p = .1606 is very close to the exact binomial probability of p = .1622. 


u =à = nt, = (200.03) = 6 


“Hur (2.71828) °(6)° 


P(x =5)=£ = 1606 
x! 5! 


Sources are not in agreement with respect to what values of n and 1, are appropriate for 
employing the Poisson approximation of the binomial distribution. Hogg and Tanis (1997) 
recommend the approximation if n > 20 and zn, < .05,orifn > 100 anda, < .10. Daniel and 
Terrell (1995) concur with the latter, and state the approximation is usually very good when 
n > 100 and nz, < 10. Rosner (1995) states the more conservative criterion that n > 100 and 
n, < Ol. 


Evaluating goodness-of-fit for a Poisson distribution There may be occasions when a re- 
searcher wants to evaluate the hypothesis that a set a data is derived from a Poisson distribution. 
Example 9.19, which is an extension of Example 9.17, will be employed to demonstrate how the 
latter hypothesis can be evaluated with the chi-square goodness-of-fit test. 


Example 9.19 The traffic bureau of a Midwestern city determines that the average number of 
accidents per day is 2. During a 300-day period the following number of accidents are recorded 
per day: a) 30 days 0 accidents occur; b) 90 days there is 1 accident; c) 89 days there are 2 
accidents; d) 53 days there are 3 accidents; e) 30 days there are 4 accidents; f) 6 days there are 
5 accidents; g) 2 days there are 6 accidents; and h)7 or more accidents do not occur on any day 
during the 300 day period. Is the distribution of the data consistent with a Poisson distribution? 


The null and alternative hypotheses that are evaluated with the chi-square goodness-of-fit 
test in reference to Example 9.19 can either be stated in the form as presented in Section III of 
the latter test (i.e., Hy: 0, = £; forall cells; H; o; + e; for at least one cell), or as follows. 


Null hypothesis 7: The sample is derived from a population with a Poisson distribution. 


Alternative hypothesis H,: The sample is not derived from a population with a Poisson 
distribution. This is a nondirectional alternative hypothesis. 


The analysis of Example 9.19 with the chi-square goodness-of-fit test is summarized in 
Table 9.7. The latter table is comprised of k = 7 cells/categories, with each cell corresponding 


to a given number of accidents per day. The second column of the table contains the observed 
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frequencies for the specified number of accidents. The expected frequency in Column 3 for each 
cell was obtained by employing Equation 8.1. Specifically the value 300, which represents the 
total number of days, is multiplied by the appropriate Poisson probability for a given number of 
accidents (which was previously computed in Example 9.17). The latter Poisson probabilities 
are as follows: x = 0 ( p 2.1353); x= 1 ( p = 2707); x = 2 ( p = .2707); x = 3 ( p = .1804); x= 
4 (p= .0902); x = 5 ( p = .0361), and x > 6 ( p = .0120 + .0046 = .0166). To illustrate the 
computation of an expected frequency, the value 40.59 is obtained for Row 1 (0 accidents) as 
follows: E, - (300)(.1353) - 40.59. 


Table 9.7 Chi-Square Summary Table for Example 9.19 (Poisson Analysis) 


Cell/ " (0, m EY 
Number of O; E, (0, - E) (0; - E) 
Accidents E; 
0 30 40.59 —10.59 112.15 2.76 
1 90 81.21 8.79 71.26 .95 
2 89 81.21 7.79 60.68 74 
3 53 54.12 -1.12 1.25 .02 
4 30 27.06 2.94 8.64 32 
5 6 10.83 4.83 23.33 2.15 
6 or more 2 4.98 -2.98 8.88 1.78 
XO, = 300 LE, = 300 LO, - E) =0 Q-28372 


Employing Equation 8.2, the value 3? = 8.72 is computed for Example 9.19. Since there 
are k = 7 cells and no parameters are estimated (i.e., w = 0), employing Equation 8.7, the 
degrees of freedom for the analysis are df= 7 - 1 -0 = 6. Employing Table A4, we determine 
that for df= 6 the tabled critical .05 and .01 values are Ys = 12.59 and Yi = 16.81. Since the 
computed value X? = 8.72 is less than both of the aforementioned values, the null hypothesis 
cannot be rejected. Thus, the analysis does not indicate that the data deviate significantly from 
a Poisson distribution. 

It should be noted that if, instead of stipulating the value u = 2 to represent the mean 
number of accidents, we computed the mean number of accidents from the sample data, the value 
of df is reduced by 1 (i.e., df2 7 -1—1 = 5). The latter is the case since an additional degree of 
freedom must be subtracted for any parameter that is estimated. In point of fact, in Example 9.19 
the sample data yield an average number of accidents equal to X =1.96. The latter value is 
obtained as follows: a) In each row of Table 9.7, multiply the value for the number of accidents 
in Column 1 by the observed frequency for that number of accidents in Column 2 (multiply 6 
by 2 in the last row, since on the two days recorded there were six accidents); and b) Sum the 
eight products obtained in a). The latter sum will equal the total number of accidents (which 
comes out to 589), and if that value is divided by the total number of days (300) it yields the 
value X =1.96. The latter value would represent the best estimate of the population mean. 
Since the latter value is almost equal to u = 2 employed in the analysis, it would not result in a 
different conclusion. 

It was noted earlier that when 7 is large and the value of 7, is very small, the Poisson 
distribution provides a good approximation for the binomial distribution. In point of fact, 
the latter conditions apply to Example 9.19. If Example 9.19 is conceptualized within the 
framework of a binomial model, n = 300 and T, = .0067. Within the binomial model there are 
300 trials and, on each trial, there are two possible outcomes, accident versus no accident. The 
value of 2, can be computed through use of Equation 9.1 as follows: a) Transpose the terms in 
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the latter equation to solve for a, ; b) Since we know the mean of the distribution is u = 2 and 
n = 300, we obtain z, = p/n = 2/300 = .0067. 

In point of fact, if the chi-square goodness-of-fit test is employed to evaluate the data for 
Example 9.19 for goodness-of-fit for a binomial distribution, the null hypothesis (which states 
that the data are derived from a binomial distribution) is supported. Table 9.8 summarizes the 
analysis. The following binomial probabilities are multiplied by 300 to get the expected 
frequency for each row of the table: x = 0 (p=.1331); x = 1 (p=.2693); x = 2 (p=.2722); 
x = 3 (p=.1821); x = 4 (p=.0904); x = 5 (p=.0369); and x > 6 (p=.0161). The latter 
probabilities were obtained through use of Equation 9.3 using the values n = 300,72, = .0067, 
m, = .9933, and the designated value of x. Note that the binomial probabilities are almost 
identical to the Poisson probabilities for the corresponding value of x. 


Table 9.8 Chi-Square Summary Table for Example 9.19 (Binomial Analysis) 


Cell/ 3 (0, - Ey 
Number of O; E, (O; - E) (0; - E) 
Accidents E, 
0 30 39.93 —9.93 98.60 2.47 
1 90 80.79 9.21 84.82 1.05 
2 89 81.66 7.34 53.88 .66 
3 53 54.63 —1.63 2.66 .05 
4 30 27.12 2.88 8.29 31 
5 6 11.07 —5.07 25.70 2.32 
6 or more 2 4.83 —2.83 8.01 1.66 
XO, - 300 LE, = 300 LO, - E,) =0 Q = 8.52 


Employing Equation 8.2, the value? = 8.52 is computed (which is almost identical 
to 3X? = 8.72 computed earlier for goodness-of-fit for a Poisson distribution). Since df= 7 — 1 
= 6, employing Table A4, we determine that the tabled critical .05 and .01 values are 
Ys = 12.59 and Yn = 16.81. Since the computed value y? = 8.52 is less than both of the 
aforementioned values, the null hypothesis cannot be rejected. Thus, the analysis does not 
indicate that the data deviate significantly from a binomial distribution. The above example 
illustrates the fact that there are circumstances when a Poisson distribution and binomial dis- 
tribution will be so similar to one another, that a goodness-of-fit test will not be able to clearly 
discriminate between the two distributions. 

Zar (1999) describes additional analytical procedures that can be employed for the Poisson 
distribution, including computation of a confidence interval and a test of significance comparing 
two Poisson counts. 


5. The matching distribution The matching distribution is a discrete probability distribution 
that can be employed to evaluate certain experimental situations. In order to describe the model 
for the matching distribution, let us assume that we have two identical decks of cards with n, 
cards in Deck 1 and n, cards in Deck 2, with n, = ny. If we conduct an experiment that is 
comprised of n trials, and on each trial we randomly select one card from Deck 1 and one card 
from Deck 2, the probability of obtaining x matches between cards in the two decks is defined 
by Equation 9.21. Note that the terms enclosed in the brackets of Equation 9.21 constitute a 
series (which is a sequence of numbers that are added to and/or subtracted from one another). 
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(Equation 9.21) 


poss od d de uer 


x01. 1! 2! 3! " — (n - xy 


Example 9.20 will be used to illustrate how Equation 9.21 is employed to compute 
probabilities for the matching distribution. 


Example 9.20 A subject claims that he has extrasensory ability. To test the subject, the 
following five playing cards are randomly arranged face down on a table: Ace of spades; King 
of spades; Queen of spades; Jack of spades; Ten of spades. The subject is given a set of five 
cards with identical face values as those on the table, and told to place each of the cards he is 
holding on top of the corresponding card on the table. Is the subject's performance statistically 
significant at the .05 level if he matches two of the five cards correctly? 


In this experiment there are n = 5 trials, and we want to determine the probability of 
obtaining two or more matches. The null hypothesis to be evaluated is that the subject will 
perform within chance expectation (or to say it another way, the performance of the subject will 
not suggest extrasensory ability). The alternative hypothesis that will be evaluated is the 
directional/one-tailed alternative hypothesis that states the subject will perform at an above 
chance level (or to say it another way, the performance of the subject suggests extrasensory 
ability). 

Equation 9.21 is employed below to compute the probability of obtaining x = 0, x = 1, 
x = 3, and x = 5 matches. Note that a subject cannot obtain 4 matches without obtaining 5 
matches, since if there are (n — 1) matches there must be n matches. The probabilities for all 
values of x sum to 1 (there is a slight discrepancy due to rounding off error). 


P(x = 0) = 





e x'xo xam 
oO 1 2! 31 4| 5 


This result tells us that the likelihood of a subject obtaining 0 matches is .3664. 


1|1 1 1 1 1 
1!|0! 1! 2! 3! 4! 
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This result tell us that the likelihood of a subject obtaining 1 match is .3747. 


1 
2! 


1 1 1 1 
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= .1665 











This result tells us that the likelihood of a subject obtaining 2 matches is .1665. 


rept lo 


| = .0835 
3!|0! 1! 2! 


This result tells us that the likelihood of a subject obtaining 3 matches is .0835. 
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This result tells us that the likelihood of a subject obtaining 4 matches is 0, since there 
cannot be 4 matches without 5 matches. 


1 


1 
Em 


P(x = 5) = a 





| = .0083 


This result tells us that the likelihood of a subject obtaining 5 matches is .0083. 

To evaluate the subject’s score of x = 2 matches, we have to determine the probability of 
obtaining two or more matches (i.e., p(x > 2) when n = 5. The latter value is computed by adding 
the probabilities for 2, 3, and 5 matches computed above, since all of those values are equal to 
or greater than a score of 2 matches. Thus, .1665 + .0835 + .0083 = .2583. Since the obtained 
value p = .2583 is greater than the value a = .05, we retain the null hypothesis. Thus, we cannot 
conclude that the subject exhibits evidence of extrasensory ability. 

It turn out that the probabilities computed for the matching distribution are quite close to 
the probabilities that will result if the problem under discussion is reconceptualized within the 
framework of the sampling with replacement model. If the latter model is used, the appropriate 
distribution to employ to compute probabilities for the number of matches is the binomial 
distribution. To illustrate the use of the sampling with replacement model, let us assume that 
after the five test cards are placed face down on the table, the subject is told to randomly select 
one card from his own identical deck of five cards, and see if it matches the first card that is face 
down on the table. The subject then puts the card he selected from his own five card deck back 
into his deck, and randomly selects a second card and sees if it matches the second card that is 
face down on the table. The subject then puts the card he selected from his own five card deck 
back into his deck and continues the same process until he has attempted to randomly match a 
card from his complete five card deck with each of the five cards that are face down on the table. 
As described, this variant of the experiment involves n = 5 trials, and on any given trial there is 
a one in five chance of the subject being correct. Thus, we are dealing with a binomially 
distributed variable, where n = 5, n, -1/5 = .2, and m, -4/5 = .8. To determine the 
probability of obtaining 0, 1, 2, 3, 4, or 5 matches, we employ the section in Table A6 for n = 
Sand m, = 2. (Note that when the sampling with replacement model is employed, it is possible 
to obtain 4 matches.) The binomial probabilities obtained from Table A6 are as follows: x = 0 
(p=.3277); x = 1 (p=.4096); x = 2 (p=.2048); x = 3 (p=.0512); x = 4 (p=.0064); x = 5 
(p =.0003). Note that the latter values are reasonably close to the probabilities for the matching 
distribution that were computed previously for the original problem. A more detailed discussion 
of the matching distribution can be found in Feller (1968). 
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Endnotes 


1. a) The binomial distribution is based on a process developed by the Swiss mathematician 
James Bernoulli (1654-1705). Each of the trials in an experiment involving a binomially 
distributed variable is often referred to as a Bernoulli trial. The conditions for Bernoulli 
trials are met when, in a set of repeated independent trials, on each trial there are only two 
possible outcomes, and the probability for each of the outcomes remains unchanged on 
every trial; b) The binomial model assumes sampling with replacement. To understand 
the latter term, imagine an urn that contains a large number of red balls and white balls. In 
each of n trials one ball is randomly selected from the urn. In the sampling with replace- 
ment model, after a ball is selected it is put back in the urn, thus insuring that the probability 
of drawing a red or white ball will remain the same on every trial. On the other hand, in the 
sampling without replacement model, the ball that is selected is not put back in the urn 
after each trial. Because of the latter, in the sampling without replacement model the prob- 
ability of selecting a red ball versus a white ball will change from trial to trial, and on any 
trial the value of the probabilities will be a function of the number of balls of each color that 
remain in the urn. The binomial model assumes sampling with replacement, since on each 
trial the likelihood that an observation will fall in Category 1 will always equal 7, , and the 
likelihood that an observation will fall in Category 2 will always equal z,. The classic 
situation for which the binomial model is employed is the process of flipping a fair coin. 
In the coin flipping situation, on each trial the likelihood of obtaining Heads is x, = .5, 
and the likelihood of obtaining Tails is m, = .5 . The process of flipping a coin can be 
viewed within the framework of sampling with replacement, since it can be conceptualized 
as selecting from a large urn that is filled with the same number of Heads and Tails on 
every trial. In other words, it's as if after each trial the alternative that was selected on that 
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trial is thrown back into the urn, so that the likelihood of obtaining Heads or Tails will 
remain unchanged from trial to trial. In Section IX (the Addendum) the hypergeometric 
distribution (another discrete probability distribution) is described, which is based upon 
the sampling without replacement model; c) The binomial distribution is actually a special 
case of the multinomial distribution. In the latter distribution, each of n independent 
observations can be classified in one of k mutually exclusive categories, where k can be any 
integer value equal to or greater than two. The multinomial distribution is described in 
detail in Section IX (The Addendum). 


2.  Thereader should take note of the fact that most sources employ the notations p and q to 
represent the population proportions z, and v. Because of this, the equations for the mean 
and standard deviation of a binomially distributed variable are written as follows: u = np 
and o = /npq. The use of the notations 1, and z, in this book for the population 
proportions is predicated on the fact that throughout the book Greek letters are employed 
to represent population parameters. 


3. Using the format employed for stating the null hypothesis and the nondirectional alternative 
hypothesis for the chi-square goodness-of-fit test, H) and H, can also be stated as follows 
for the binomial sign test for a single sample: H,: o; = £; for both cells; H,: o; # ©, for 
both cells. Thus, the null hypothesis states that in the underlying population the sample 
represents, for both cells/categories, the observed frequency of a cell is equal to the 
expected frequency of the cell. The alternative hypothesis states that in the underlying 
population the sample represents, for both cells/categories, the observed frequency of a cell 
is not equal to the expected frequency of the cell. 


4. The question can also be stated as follows: If n — 10 and m, - m, - .5, what is the 
probability of 2 or less observations in one of the two categories? When x, = x, = .5, the 
probability of two or less observations in Category 2 will equal the probability of eight or 
more observations in Category 1 (or vice versa). When, however, 2, * n, , the probability 
of 2 or less observations in Category 2 will not equal the probability of 8 or more 
Observations in Category 1. 


5. The number of combinations of n things taken x at a time represents the number of 
different ways that n objects can be arranged x at a time without regard to order. For 
instance, if one wants to determine the number of ways that 3 objects (which we will 
designated A, B, and C) can be arranged 2 at a time without regard to order, the following 
3 outcomes are possible: 1) An A and a B (which can result from either the sequence AB 
or BA); 2) An A and a C (which can result from either the sequence AC or CA); or3) AB 
and a C (which can result from the sequence BC or CB). Thus, there are 3 combinations 
of ABC taken 2 at a time. This is confirmed below through use of Equation 9.4. 


[js 3! E 
2 2410-2)! 211! 


To extend the concept of combinations to a coin tossing situation, let us assume that 
one wants to determine the exact number of ways 2 Heads can be obtained if a coin is 
tossed 3 times. If a fair coin is tossed 3 times, any one of the following 8 sequences of 
Heads (H) and Tails (T) is equally likely to occur: HHH, HHT, THH, HTH, THT, HTT, 
TTH, TTT. Of the 8 possible sequences, only the following 3 sequences involve 2 Heads 
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and 1 Tails: HHT, THH, HTH. The latter 3 arrangements represent the combination of 3 
3! 


things taken 2 at a time. This can be confirmed by (3 dicm 3 

When the order of the arrangements is of interest, one can compute the number of 
permutations of n things taken x at a time. The latter is represented by the notation P”, 
where P” = n!/(n - x)!. Thus, if one is interested in the order when the events A, B, 
and C are taken 2 at a time, the following number of permutations are computed: 
Pj = 3!/(3 - 2)! = 3!/1!. The6possible arrangements taking order into account are AB, 
BA, AC, CA, BC, and CB. 

Although based on the definition of a permutation that has been presented above, one 
might conclude that the three combinations HHT, THH, HTH take order into account, and 
thus represent permutations, they can be conceptualized as combinations if one views the 
binomial model as follows: Within each of the three combinations HHT, THH, HTH, the 
2 Heads are distinguishable from one another, insofar as each of the Heads can be 
assigned a subscript to designate it as distinct from the other Heads. To illustrate, within 
the combination HHT, H, and H, can be employed to distinguish the two Heads from one 
another. If we imagine that the two Heads are randomly selected from an urn, and one 
Heads is assigned the label H, , and the other Heads is assigned the label H, , the following 
two permutations are possible: H,H,T, H,H,T. Thus, the arrangement HHT is a com- 
bination that summarizes the two distinct permutations H,H,T, H,H,T. Based on what has 
been said, it follows that the following 6 permutations comprise the 3 combinations HHT, 
THH, and HTH: H,H,T, H,H,T, THLH,, TH,H,, H TH,, H,TH,. In point of fact, many 
sources (e.g., Marascuilo and McSweeney (1977, pp. 12-13)) describe the value computed 
for the binomial coefficient as a permutation, since it can be viewed as representing a value 
based on two sets of different but identical objects. 


6.  Theapplication of Equation 9.3 to every possible value of x (i.e., in the case of Examples 
9.1 and 9.2, the integer values 0 through 10) will yield a probability for every value of x. 
The sum of these probabilities will always equal 1. The algebraic expression which sum- 
marizes the summation of the probability values for all possible values of x is referred to 
as the binomial expansion, summarized by Equation 9.5, which is equivalent to the 
general equation (t, + 75)" (or (p + q)" when p and q are employed in lieu of 1, and 
m,). Thus: 


> (z) ao a ^9 = (m+ x" 


To illustrate, if n 23, n, = .5, and m, = .5, the binomial expansion is as follows. 
(n, + m? = (n) + 3G) + 3m) + Guy 
= (5y + 3C5YC5) + 365)65Y. + C5y 
Each of the four terms that comprise the above noted binomial expansion can be 
computed with Equation 9.3 as noted below. The computed probabilities .125, .375, .375, 
and .125 are respectively the likelihood of obtaining 3, 2, 1, and 0 outcomes of 7, if 


n-3andm, = .5. 


Term 1 (P(3/3)) = (x? = (3) (595)? = 125 
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Term 2 (P(2/3)) = 3, (m) = (3 (.5)°(.5) = .375 


375 


Term 3 (P(1/3)) = 3x) = (1 C5)C5! 
Term 4 (P(0/3)) = (x? = (3 (5YC5Y = 125 


7.  Ifnz.5andone wants to determine the likelihood of x being equal to or less than a specific 
value, one can employ the cumulative probability listed for the value (n — x). Thus, if 
x = 2, the cumulative probability for x = 8 (which is .0547) is employed since 
n -x= 10 - 2 = 8. The value .0547 indicates the likelihood of obtaining 2 or less 
observations in a cell. This procedure can only be used when 7, = .5, since, when the 
latter is true, the binomial distribution is symmetrical. 


8. If in using Tables A6 and A7 the value of z, is employed to represent x in place of n, , and 
the number of observations in Category 2 is employed to represent the value of x instead 
of the number of observations in Category 1, then the following are true: a) In Table A6 
(if all values of x within the range from 0 to 1 are listed) the probability associated with the 
cell that is the intersection of the values m = z, and x (where x represents the number of 
observations in Category 2) will be equivalent to the probability associated with the cell 
that is the intersection of m = z, and x (where x represents the number of observations in 
Category 1); and b) In Table A7 (if all values of zx within the range from 0 to 1 are listed) 
for n = 7, the probability of obtaining x or fewer observations (where x represents the 
number of observations in Category 2) will be equivalent to for x = m,,the probability of 
obtaining x or more observations (where x represents the number of observations in 
Category 1). Thusif t, = .7, m, = .3,andn- 10 and there are 9 observations in Category 
1 and 1 observation in Category 2, the following are true: a) The probability in Table A6 
for n = m, = .3 and x = 1 will be equivalent to the probability for t = m, = .7 and 
x = 9; and b) In Table A7 if x = x, = .3, the probability of obtaining 1 or fewer 
observations will be equivalent to the probability of obtaining 9 or more observations if 
m=, =.7. 

A modified protocol for employing Table A7 when a more extreme value is defined 
as any value that is larger than the smaller of the two observed frequencies, or smaller than 
the larger of the two observed frequencies is described in Section VIII in reference to 
Example 9.9. 


9. It will also answer at the same time whether p, = 2/10 = .2, the observed proportion of 
cases for Category 2, deviates significantly from m, = .5. 


10. Since like the normal distribution the binomial distribution is a two-tailed distribution, the 
same basic protocol is employed in interpreting nondirectional (i.e., two-tailed) and 
directional (one-tailed) probabilities. Thus, in interpreting binomial probabilities one can 
conceptualize a distribution that is similar in shape to the normal distribution, and substitute 
the appropriate binomial probabilities in the distribution. 


11. In Example 9.4 many researchers might prefer to employ the directional alternative 
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12. 


13. 


14. 


15. 


16. 


hypothesis H,: n, < .5, since the senator will only change her vote if the observed 
proportion in the sample is less than .5. In the same respect, in Example 9.5 one might 
employ the directional alternative hypothesis H,: x, > .5,since most people would only 
interpret above chance performance as indicative of extrasensory perception. 


Equation 9.7 can also be expressed in the form z = (X — u)/o. Note that the latter equation 
is identical to Equation I.27, the equation for computing a standard deviation score for a 
normally distributed variable. The difference between Equations 9.7 and I.27 is that 
Equation 9.7 computes a normal approximation for a binomially distributed variable, 
whereas Equation I.27 computes an exact value for a normally distributed variable. 


The reader may be interested in knowing that in extrasensory perception (ESP) research, 
evidence of ESP is not necessarily limited to above chance performance. A person who 
consistently scores significantly below chance or only does so under certain conditions 
(such as being tested by an extremely skeptical and/or hostile experimenter) may also be 
used to support the existence of ESP. Thus, in Example 9.6, the subject who obtains a 
score of 80 (which is significantly below the expected value u = 100) represents someone 
whose poor performance (referred to as psi missing) might be used to suggest the presence 
of extrasensory processes. 


When the binomial sign test for a single sample is employed to evaluate a hypothesis 
regarding a population median, it is categorized by some sources as a test of ordinal data 
(rather than as a test of categorical/nominal data), since, when data are categorized with 
respect to the median, it implies ordering of the data within two categories (i.e., above the 
median versus below the median). 


a) In the discussion of the Wilcoxon signed-ranks test, it is noted that the latter test is not 
recommended if there is reason to believe that the underlying population distribution is 
asymmetrical. Thus, if there is reason to believe that blood cholesterol levels are not 
distributed symmetrically in the population, the binomial sign test for a single sample 
would be recommended in lieu of the Wilcoxon signed-ranks test; b) Marascuilo and 
McSweeney (1977) note that the asymptotic relative efficiency (discussed in Section VII 
of the Wilcoxon signed-ranks test) of the binomial sign test for a single sample is gen- 
erally lower than that of the Wilcoxon signed-ranks test. If the underlying population 
distribution is normal, the asymptotic relative efficiency of the binomial sign test is .637, 
in contrast to an asymptotic relative efficiency of .955 for the Wilcoxon signed-ranks test 
(with both asymptotic relative efficiencies being in reference to the single-sample ¢ test). 
When the underlying population distribution is not normal, in most cases, the asymptotic 
relative efficiency of the Wilcoxon signed-ranks test will be higher than the analogous 
value for the binomial sign test. 


The reader should take note of the fact that the protocol in using Table A7 to interpret a 
m, value that is less than .5 in reference to the value z, =.1 is different than the one 
described in the last paragraph of Section IV. The reason for this is that in Example 9.9 we 
are interested in (for n, = .1) the probability that the number of observations in Category 
2 (females) are equal to or greater than 3 (which equals the probability that the number of 
observations in Category 1 (males) are equal to or less than 7 for t, = .9). The protocol 
presented in the last paragraph of Section IV in reference to the value m, = .3 describes the 
use of Table A7 to determine the probability that the number of observations in Category 
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2 are equal to or less than 1 (which equals the probability that the number of observations 
in Category 1 are equal to or greater than 9 for x, = .7). Note that in Example 9.9 a more 
extreme score is defined as one that is larger than the lower of the two observed frequencies 
or smaller than the larger of the two observed frequencies. On the other hand, in the 
example in the last paragraph of Section IV a more extreme score is defined as one that is 
smaller than the lower of the two observed frequencies or larger than the higher of the two 
observed frequencies. The criteria for defining what constitutes an extreme score is directly 
related to the alternative hypothesis the researcher employs. If the alternative hypothesis 
is nondirectional, an extreme score can fall both above or below an observed frequency, 
whereas if a directional alternative hypothesis is employed, a more extreme score can only 
be in the direction indicated by the alternative hypothesis. 


17. Endnote 5 in the Introduction states that the value e is the base of the natural system of 


logarithms. e, which equals 2.71828... , is an irrational number (i.e., a number that has 
a decimal notation that goes on forever without a repeating pattern of digits). 
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Test 10 


The Single-Sample Runs Test 
(and Other Tests of Randomness) 


(Nonparametric Test Employed with Categorical/Nominal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Is the distribution of a series of binary events in a population 
random? 


Relevant background information on test By definition a random series is one for which no 
algorithm (i.e., set of rules) can be generated that will allow one to predict at above chance which 
of the k possible alternatives will occur on a given trial.' The single-sample runs test is one of 
a number of statistical procedures that have been developed for evaluating whether or not the 
distribution of a series of N numbers is random. The test evaluates the number of runs in a 
series in which, on each trial, the outcome must be one of k = 2 alternatives. Within the series, 
one of the alternatives occurs on n, trials and the other alternative occurs on n, trials. Thus, 
n, + n, = N. Arunis a sequence within a series in which one of the k alternatives occurs on 
consecutive trials. On the trial prior to the first trial of a run (with the exception of Trial 1 in the 
series) and the trial following the last trial of a run (with the exception of the N” trial in the 
series), the alternative that occurs will be different than the alternative that occurs during each 
of the trials of the run. The minimum length of a run is one trial, and the maximum length of a 
run is equal to N, the total number of trials in the series. To illustrate the computation of the 
length of a run, consider the three series noted in Figure 10.1. Each series is comprised of N = 
10 trials. On each trial a coin is flipped and the outcome of Heads (H) or Tails (T) is recorded. 


Trial 1 2 3 4 5 6 7 8 9 10 
Series A: H H T H H T T T H T 
Series B: T H T H T H T H T H 
Series C: H H H H H H H H H H 


Figure 10.1 Illustration of Runs 


In Series A and Series B there are n, = 5 Heads and n, = 5 Tails. In Series C there are 
n, = 10 Heads and n, = 0 Tails. 

In Series A there are six runs. Run | consists of Trials 1 and 2 (which are Heads). Run 2 
consists of Trial 3 (which is Tails). Run 3 consists of Trials 4 and 5 (which are Heads). Run 4 
consists of Trials 6-8 (which are Tails). Run 5 consists of Trial 9 (which is Heads). Run 6 
consists of Trial 10 (which is Tails). This can be summarized visually by underlining all of the 
runs as noted below. Note that all the runs are comprised of sequences involving the same 

In Series B there are 10 runs. Each of the trials constitutes a separate run, since on each trial 
a different alternative occurs. Note that on each trial the alternative for that trial is preceded by 


€ 2000 by Chapman & Hall/CRC 


definition of a run, Trial 1 cannot be preceded by a different alternative, since it is the first trial, 
and Trial 10 cannot be followed by a different alternative, since it is the last trial. 

In Series C there is one run. This is the case, since the same alternative occurs on each 
trial. Thus: HH H H H H H H H H. 

Intuitively, one would expect that of the three series, Series A is most likely to conform to 
the definition of a random series. This is the case, since it is highly unlikely that a random series 
will exhibit a discernible pattern that will allow one to predict at above chance which of the 
alternatives will appear on a given trial. Series B and C, on the other hand, are characterized by 
patterns that will probably bias the guess of someone who is attempting to predict what the 
outcome will be if there is an eleventh trial. It is lo gical to expect that the strength of such a bias 
will be a direct function of the length of any series exhibiting a consistent pattern 2 

The test statistic for the single-sample runs test is based on the assumption that the number 
of runs in a random series will be expected to fall within a certain range of values. Thus, if for 
a given series the number of runs is less than some minimum value or greater than some 
maximum value, it is likely that the series is not random. The determination of the minimum 
allowable number of runs and maximum allowable number of runs in a series of N trials takes 
into account the number of runs, as well as the frequency of occurrence of each of the two 
alternatives within the series. 

It should be noted that although thesingle-sample runs test is most commonly employed 
with a binomially distributed variable for which x, = 7, = .5 (as is the case for a coin toss), 
it is not required that the values x, and 7, equal .5 in the underlying population. It is important 
to remember that the runs test does not evaluate a hypothesis regardi ng the values ofr, and T, 
in the underlying population, nor does it make any assumption with regard to the latter values. 
The test statistic for the runs test is a function of the proportion of times each of the alternatives 
occurs in the sample data/series (i.e., p, = n,/N and p, = n,/N). If the observed proportion 
for each ofthe alternatives is inconsistent with its actual proportion inthe underlying population, 
the single-sample runs test is not designed to detect such a difference. Example 10.7 in Section 
VIII illustrates a situ ation in which thesingle-sample runs test is employed when it is known 
that T, * T, * .5. 


II. Example 

Example 10.1 Jn a test of extrasensory ability, a coin is flipped 20 times by an experimenter. 
Prior to each flip of the coin the subject is required to guess whether it will come up Heads or 
Tails. After each trial the subject is told whether his guess is correct or incorrect. The actual 
outcomes for the 20 coin flips are listed below: 


HHHTTTHHTTHTHTHTTTHH 


To rule out the possibility that the subject gained extraneous information as a result of a 
nonrandom pattern in the above series, the experimenter decides to evaluate it with respect to 
randomness. Does an analysis of the series suggest that it is nonrandom? 


III. Null versus Alternative Hypotheses 


Nullhypothesis H,: Theevent in the underlying population represented by the sample series 
are distributed randomly. 
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Alternative hypothesis H,: The events in the underlying population represented by the sample 
series are distributed nonrandomly. (This is a nondirectional alternative hypothesis and it is 
evaluated with a two-tailed test.) 


or 


H,: The events in the underlying population represented by the sample series are distributed 
nonrandomly due to too few runs. (This is a directional alternative hypothesis and it is evalu- 
ated with a one-tailed test.) 


or 


H,: The events in the underlying population represented by the sample series are distributed 
nonrandomly due to too many runs. (This is a directional alternative hypothesis and it is 
evaluated with a one-tailed test.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected. 


IV. Test Computations 


In order to compute the test statistic for the single-sample runs test, one must determine the 
number of times each of the two alternatives appears in the series and the number of runs in the 
series. Thus, we determine that the series described in Example 10.1 is comprised of n, - 10 
Heads and n, = 10 Tails. Note that n, + n, = N = 20. We also determine that there are 
r = 11 runs, which represents the test statistic for the single-sample runs test. Specifically as 
one moves from Trial 1 to Trial 20: H H H(Run1);; T T T(Run2; H H(Run3); T T (Run 
4); H (Run 5); T (Run 6); H (Run 7); T (Run 8); H (Run 9); T T T (Run 10); andH H (Run 
11). This can also be represented visually by underlining each of the runs. Thus: 


HHHTTTHHTTHTHTH TTT HH 


V. Interpretation of the Test Results 


The computed value r = 11 is interpreted by employing Table A8 (Table of Critical Values for 
the Single-Sample Runs Test) in the Appendix. The critical values listed in Table A8 only 
allow the null hypothesis to be evaluated at the .05 level if a two-tailed/nondirectional alternative 
hypothesis is employed, and at the .025 level if a one-tailed/directional alternative hypothesis is 
employed. No critical values are recorded in Table A8 for the single-sample runs test for very 
small sample sizes, since the levels of significance employed in the table cannot be achieved for 
sample sizes below a specific minimum value. More extensive tables for the single-sample runs 
test which provide critical values for other levels of significance can be found in Swed and 
Eisenhart (1943) and Beyer (1968)? 

Note that in Table A8 the critical r values are listed in reference to the values of n, and n,, 
which represent the frequencies that each of the alternatives occurs in the series. Since in 
Example 10.1, n, = 10 and n, = 10, we locate the cell in Table A8 that is the intersection of 
these two values. In the appropriate cell, the upper value identifies the lower limit for the value 
of r, whereas the lower value identifies the upper limit for the value of r. The following 
guidelines are employed in reference to the latter values. 

a) If the nondirectional alternative hypothesis is employed, to reject the null hypothesis, the 
obtained value of r must be equal to or greater than the tabled critical upper limit at the 
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prespecified level of significance, or be equal to or less than the tabled critical lower limit at the 
prespecified level of significance. 

b) If the directional alternative hypothesis predicting too few runs is employed, to reject 
the null hypothesis, the obtained value of r must be equal to or less than the tabled critical lower 
limit at the prespecified level of significance. 

c) If the directional alternative hypothesis predicting too many runs is employed, to reject 
the null hypothesis, the obtained value of r must be equal to or greater than the tabled critical 
upper limit at the prespecified level of significance. 

Employing Table A8, we determine that for n, - n, - 10, the tabled critical lower and 
upper critical r values are r = 6 and r= 16. Thus, if the nondirectional alternative hypothesis is 
employed (with a = .05), the obtained value of r will be significant if it is equal to or less than 
6 or equal to or greater than 16. In other words, it will be significant if there are either 16 or 
more runs or 6 or less runs in the data. Since r= 11 falls inside this range, the nondirectional 
alternative hypothesis is not supported. 

If the directional alternative hypothesis predicting too few runs is employed (with a = .025), 
the obtained value of r will only be significant if it is equal to or less than 6. In other words, it 
will only be significant if there are 6 or less runs in the data. Since r = 11 is greater than 6, the 
directional alternative hypothesis predicting too few runs is not supported. 

If the directional alternative hypothesis predicting too many runs is employed (with 
a = .025), the obtained value of r will only be significant if it is equal to or greater than 16. In 
other words, it will only be significant if there are 16 or more runs in the data. Since r= 11 is 
less than 16, the directional alternative hypothesis predicting too many runs is not supported. 

Our analysis indicates that regardless of which alternative hypothesis one employs, the null 
hypothesis cannot be rejected. Thus, the data do not allow the researcher to conclude that the 
series is nonrandom. 


VI. Additional Analytical Procedures for the Single-Sample Runs 
Test and/or Related Tests 


1. The normal approximation of the single-sample runs test for large sample sizes The 
normal distribution can be employed with a large sample size/series to approximate the exact 
distribution of the single-sample runs test. The large sample approximation is generally 
employed for sample sizes larger than those documented in Table A8. Equation 10.1 is 
employed for the normal approximation of the single-sample runs test. 


2n,n 
i PX. 
n + n, 








(Equation 10.1) 












2n,n,(2n,n, - n, - n, 


(n, + n»n + n, - 1) 


In the numerator of the above equation the term [(27,n,)/(n, + n,)] + 1 represents the 
mean of the sampling distribution of runs in a random series in which there are N observations. 
The latter value may be summarized with the notation u,. In other words, given n, = 10 and 
n, - 10, if in fact the distribution is random, the best estimate of the number of runs one can 
expect to observe is p, = 11. The denominator in Equation 10.1 represents the expected 
standard deviation of the sampling distribution for the normal approximation of the test statistic. 
The latter value is summarized by the notation o,. 
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Employing Equation 10.1 with the data for Example 10.1, the value z = 0 is computed. 


"M EE : ] 
10 + 10 -2 _9 


0 
Q)uod0joyx10)00)- 10 -10] 218 
(10 + 1010 + 10 - 1) 


Since p, = 11 and o, = 2.18, the result of the above analysis can also be summarized as 
follows: z = (11 - 11)/2.18 = 0. 

The obtained value z = 0 is evaluated with Table A1 (Table of the Normal Distribution) 
in the Appendix. To be significant, the obtained absolute value of z must be equal to or greater 
than the tabled critical value at the prespecified level of significance. The tabled critical two- 
tailed .05 and .01 values are Z), = 1.96 and zy, = 2.58, and the tabled critical one-tailed .05 
and .01 values are zog, = 1.65 and Zo = 2.33. The following guidelines are employed in 
evaluating the null hypothesis. 

a) If the nondirectional alternative hypothesis is employed, to reject the null hypothesis, the 
obtained absolute value of z must be equal to or greater than the tabled critical two-tailed value 
at the prespecified level of significance. In Example 10.1 the nondirectional alternative hypoth- 
esis is not supported, since the obtained value z = 0 is less than both of the aforementioned tabled 
critical two-tailed values. 

b) If the directional alternative hypothesis predicting too few runs is employed, to reject 
the null hypothesis, the following must be true: 1) The obtained value of z must be a negative 
number; and 2) The absolute value of z must be equal to or greater than the tabled critical 
one-tailed value at the prespecified level of significance. In Example 10.1 the directional al- 
ternative hypothesis predicting too few runs is not supported, since the obtained value z = 0 is 
not a negative number (as well as the fact that it is less than the tabled critical one-tailed values Z o; = 1.65 
and Z = 2.33). 

c) If the directional alternative hypothesis predicting too many runs is employed, to reject 
the null hypothesis, the following must be true: 1) The obtained value of z must be a positive 
number; and 2) The absolute value of z must be equal to or greater than the tabled critical 
one-tailed value at the prespecified level of significance. In Example 10.1 the directional 
alternative hypothesis predicting too many runs is not supported, since the obtained value z = 0 
is not a positive number (as well as the fact that it is less than the tabled critical one-tailed values 
Zos = 1.65 and Zo = 2.33). 

Thus, when the normal approximation is employed, as is the case when the critical values 
in Table A8 are used, the null hypothesis cannot be rejected regardless of which alternative 
hypothesis is employed. Consequently, we cannot conclude that the series is not random. 


2. The correction for continuity for the normal approximation of the single-sample runs 

test Although it is not described by most sources, Siegel and Castellan (1988) recommend that 

a correction for continuity be employed for the normal approximation of the single-sample runs 

test. Equation 10.2, which is the continuity-corrected equation, will always yield a smaller 

absolute z value than the value derived with Equation 10.1.° 
|r = e - 5 


o 
r 





(Equation 10.2) 


Employing Equation 10.2 with the data for Example 10.1, the value z = —.23 is computed. 
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_ [tl - 11) - .5 _ 
Z =o oc 
2.18 


-.23 


Since the absolute value z = .23 is lower than the tabled critical two-tailed value 
Zo5= 1.96 and the tabled critical one-tailed value z ọ, = 1.65, the null hypothesis cannot be 
rejected, regardless of which alternative hypothesis is employed (which is also the case when the 
correction for continuity is not employed). Thus, we cannot conclude that the series is not 
random. 


3. Extension of the runs test to data with more than two categories Wallis and Roberts 
(1956) and Zar (1999) note that Equations 10.3 and 10.4 can be employed for the single-sample 
runs test when the data fall into more than two categories. Equation 10.3 computes the mean 
of the sampling distribution (i.e., u,, the expected number of runs) and, Equation 10.4 computes 
the expected standard deviation of the sampling distribution ( 0,). When there are two categories, 
Equation 10.3 is equivalent to the term on the right side of the numerator of Equation 10.1 (1.e., 
u,), and Equation 10.4 is equivalent to the denominator of Equation 10.1 (i.e., o,). The values 
computed for u, and o, are substituted in the normal approximation equation for the single- 
sample runs test — i.e., Equation 10.1 (the continuity-corrected version Equation 10.2 may also 
be employed). The computed value of z is interpreted in the same manner as when there are two 
categories. 


_ MN + D) - Xn 


Equation 10.3 
h, m (Eq ) 


Xn]|Yn) + NN + 1) - 2NEn) - N? 
N?(N - 1) 





(Equation 10.4) 





With respect to the notation employed in Equations 10.3 and 10.4, note that there will 
be k categories with the following number of trials/observations for each category: 7^, 7, nz, 
..., N,. Since n; represents the number of trials/observations for the i" category, N = n,;. The 
notation Xu and Yn? , respectively, indicate that the number of observations in each category 
are squared and cubed, and the latter values are summed. 

To illustrate the single-sample runs test when there are more than two categories, assume 
we have a three-sided die that on each trial can come up as face value A, B, or C. Assume the 
pattern of results noted below is obtained. 


Each of the runs (which are comprised of one or more consecutive identical outcomes) is 
underlined, yielding a total of 12 runs. If we let n, represent the number of trials in which A 
appears, n, represent the number of trials in which B appears, and n, represent the number of 
trials in which C appears, then n, = 7, n, = 8, n, = 5, and N = 20. We can compute that 
Yn? = 7 + 8 +5 = 138 and Xn? = 7 + 8 «5 = 980. Substituting the appropriate 
values in Equations 10.3, 10.4, and 10.1, we compute the value z = —-1.06. 


_ 20(20 + 1) - 138 


= 14.1 
ý 20 
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o, = | 138[138 + 20(20 + 1) - (2)(20)(980) - Q0 _ 1.98 


(20 Q0 - 1) 


The correction for continuity can be employed by subtracting .5 from the absolute value 
in the numerator of Equation 10.1. When the correction for continuity is employed, 
z=({12—-14.1| -.5)/1.98 2 —81. Since the absolute values z = 1.06 and z 2.81 are lower than 
the tabled critical two-tailed value z,,= 1.96 and the tabled critical one-tailed value 
Zos = 1.65, the null hypothesis cannot be rejected, regardless of which alternative hypothesis 
is employed. Thus, we cannot conclude that the series is not random. The negative sign for z 
just indicates that the observed number of runs was less than the expected value. 

The reader should note that Table A8 cannot be employed to evaluate the results of the 
analysis described in this section, since the latter table is only designed for use with two 
categories. Zar (1999) notes that O'Brien (1976) and O'Brien and Dyck (1985) have developed 
a more powerful version of the runs test that can be employed when there are more than two 
categories. Other tests of randomness that can be employed when there are more than two 
categories are discussed in Section IX (the Addendum). 


4. Test 10a: The runs test for serial randomness A variant of the single-sample runs test 
described in this section is the runs test for serial randomness (also referred to as the up-down 
runs test). The use of the term serial refers to the analysis of the sequence of events in a series. 
Attributed to Wallis and Moore (1941), the runs test for serial randomness (which is also 
discussed in Schmidt and Taylor (1970) and Zar (1999)) is employed when the data being 
evaluated are in a quantitative rather than a categorical format. In such a case a researcher might 
want to determine if the shifts in the direction (i.e., up or down) of a sequence of scores is in a 
random order. Within the framework of this test, each shift in direction represents the beginning 
of a new run. The total number of runs is the total number of directional shifts in a set of data. 
To illustrate, consider the following set of ten scores: 


+-++-- --— + 
2, 3, 1, 6, 7, 4, 3, 2, 1, 7, 


Note that a plus sign (+) is recorded at the upper right of a score if the score that follows 
it is larger, and a minus sign (—) is recorded if the score that follows it is smaller. Runs are de- 
termined as they were with the single-sample runs test. Thus, each string of plus signs or minus 
signs constitutes a run. In the above example there are five runs. The first two runs are com- 
prised of a plus and a minus sign (+ —), followed by a run comprised of two plus signs (+ +), 
followed by a run comprised of four minus signs (-———), followed by a run comprised of a plus 
sign (+). Note that the total number of plus and minus signs is one less than the total number of 
scores. 

The null hypothesis evaluated by the runs test for serial randomness is that in a set of data 
the distribution of successive changes in direction (i.e., runs) is random. The nondirectional 
alternative hypothesis is that in a set of data the distribution of successive changes in direction 
(runs) is not random. The alternative hypothesis can also be stated directionally. Specifically, 
one can predict a nonrandom pattern involving an excessive number of shifts in direction 
(resulting in a higher than expected number of runs), or a nonrandom pattern involving very few 
shifts in direction (resulting in a lower than expected number of runs). 
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Zar (1999) has prepared a table of exact probabilities for the runs test for serial random- 
ness when N « 50. However, for large sample sizes (generally 50 or more trials/observations), 
Equation 10.7 can be employed to evaluate the results of the test. Note that although the latter 
equation has the same structure as the normal approximation equation for the single-sample runs 
test, in the case of the runs test for serial randomness Equations 10.5 and 10.6 are employed 
to compute the values of the expected number of runs ( p) and the expected standard deviation 
(0,). The computed value of z is interpreted in the same manner as it is for the single-sample 
runs test. As is the case with the latter test, it will require either a very large or very small 
number of runs to reject the null hypothesis, and thus conclude that the data indicate a lack of 
randomness. 








B, = 3 (Equation 10.5) 
gpa (eae (Equation 10.6) 
90 
r-u 
z= 7 (Equation 10.7) 
c 


Example 10.2 will be employed to illustrate the runs test for serial randomness. Although 
the sample size in Example 10.2 is less than the value recommended for the normal approxi- 
mation, it will be employed to demonstrate the test. Since it provides for a more conservative 
test, one has the option of employing the correction for continuity (through use of Equation 10.2) 
to lower the likelihood of committing a Type I error. 


Example 10.2 A quality control study is conducted on a machine that pours milk into 
containers. The amount of milk (in liters) dispensed by the machine into 21 consecutive 
containers follows: 1.90, 1.99, 2.00, 1.78, 1.77, 1.76, 1.98, 1.90, 1.65, 1.76, 2.01, 1.78, 1.99, 
1,76, 1.94, 1.78, 1.67, 1.87, 1.91, 1.91, 1.89. Are the successive increments and decrements in 
the amount of milk dispensed random? 


The sequence of up-down shifts is summarized below. 


++ -—- t= cde So oce £a 0- 

The following 20 symbols are recorded above (one less than the total number of 
Observations): 9 pluses, indicating an increase from one measurement to the next; 10 minuses, 
indicating a decrease from one measurement to the next; and one zero indicating no change (for 
the two values of 1.91). When one or more zeroes are present in the data, the number of runs are 
determined if a zero is counted as a plus, as well as if a zero is counted as a minus. If the zero is 
counted as a plus, the total number of runs will equal 12. This is the case since prior to the zero 
there are 11 runs. If the zero is counted as a plus it extends the 11th run (which will now consist 
of three pluses instead of two pluses), and the last minus constitutes the 12th run. If the zero is 
counted as a minus there will still be 12 runs, since there are 11 runs up to the zero, and if the zero 
is viewed as a minus it joins with the last minus to comprise the 12th run, which will now consist 
of two minuses. In some cases if a zero is present, a different total will be obtained for the 
number of runs, depending upon whether the zero is viewed as a plus or minus. When the latter 
is true, a test statistic is obtained for each run value, and a decision is made based on both values. 
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If more than one zero is present in the data the analysis can get quite tedious, since one has to 
consider all possible combinations of counting any zero as a plus or minus. In such a case, it 
would probably be advisable to employ the single-sample runs test (see Example 10.5) or some 
alternative test of randomness (some of which are discussed in Section IX (the Addendum)). 
The data for Example 10.2 will now be evaluated employing Equations 10.5—10.7. For our 
example, N = 21, which represents the total number of observations, and r= 12, which represents 
the number of runs. 


- oDi -1 = 13.67 


r 


soc UOC gi 
r 90 


zu d E ET 
1.85 


Since the absolute value z = .90 is lower than the tabled critical two-tailed value 
Zo5= 1.96 and the tabled critical one-tailed value zo, = 1.65, the null hypothesis cannot be 
rejected, regardless of which alternative hypothesis is employed. The observed value r = 12 is 
well within chance expectation for the number of runs expected in a random distribution. The 
negative sign for z just indicates that the observed number of runs was less than the expected 
value. 

The value r = 12 also does not achieve significance if one employ’s Zar’s (1999) table of 
critical values. In the latter table, for n = 21, the critical two-tailed .05 values are respectively 
9 and 18, and the critical one tailed .05 values are 10 and 18. In order to be significant, the 
obtained value of r must be equal to or less than than the first number in each pair or equal to 
or greater than the second number in each pair. Since r = 12 is in between the limits that 
define the critical values, the null hypothesis is retained. Thus, regardless of whether we employ 
Equation 10.7 or Zar’s (1999) table of critical values, we cannot conclude that the series for 
dispensing milk is not random. 

It should be noted that the same set of data employed for Example 10.2 is evaluated with 
the single-sample runs test (see Example 10.5), also yielding a nonsignificant result. However, 
as is the case with the single-sample runs test, the runs test for serial randomness has the 
limitation that it can yield a nonsignificant result with data that is clearly nonrandom. Schmidt 
and Taylor (1970) provide an example in which the values of all the observations in the first half 
of a series fall below the median, while the values of all the observations in the second half of 
the series fall above the median value (a pattern that would not be expected in a random series). 
Yet in spite of the latter, the number of runs in the series falls within chance expectation if the 
data are analyzed with the runs test for serial randomness. As a general rule, runs tests are not 
the most stringent tests with respect to evaluating a hypothesis regarding randomness. 

A discussion of the power of the runs test for serial randomness can be found in Levene 
(1952). Additional tests of randomness involving the analysis of runs can be found in Banks 
and Carson (1984), Phillips et al. (1976), and Schmidt and Taylor (1970). The latter sources 
describe tests that evaluate the observed verus expected frequency distribution of the length of 
runs for series evaluated with the runs test for serial randomness, as well as the single-sample 
runs test. 
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VII. Additional Discussion of the Single-Sample Runs Test 


1. Additional discussion of the concept of randomness It is important to note that a 
distinction is made between a random and pseudorandom series of numbers. Itis assumed that 
if a series is random, there is no algorithm that will allow one to predict at above chance which 
of the possible outcomes will occur on a given trial. Truly random processes are event sequences 
that occur within a natural context (i.e., real world phenomena such as the radioactive decay of 
atomic nuclei and Browning molecular motion). A pseudorandom series, however, is generated 
through use of a computer program that employs a deterministic algorithm. As a result of this, 
if one is privy to the rule stated by the algorithm, one will be able to correctly predict all of the 
numbers in the pseudorandom series in the order in which they are generated. Pseudorandom 
series are often employed to simulate naturally occurring random events, since their use in the 
latter context provides researchers with a mechanism for studying phenomena that otherwise 
would be impossible or problematical to evaluate. Research employing pseudorandom number 
series is commonly referred to as Monte Carlo research. Peterson (1998) notes that the use of 
the term Monte Carlo grew out of the work of Stanislaw Ulam and John von Neumann, two 
brilliant mathematicians who in the late 1940s using the earliest computers conducted seminal 
research on the simulation of random processes. In addition to using pseudorandom numbers 
for the latter, they are commonly employed today to generate outcomes in slot machines as well 
as in presenting random stimuli in computer software (such as in video games). More recently, 
the popularity of data-driven statistical methods (discussed in Section IX (the Addendum) of the 
Mann-Whitney U test (Test 12)) has increased the demand for reliable random number gen- 
erators. 

To employ pseudorandom numbers effectively within the framework of simulation, it is 
essential to demonstrate that any mathematically generated series is, in fact, random. Yet the 
latter is easier said than done. In the second volume of his classic book Seminumerical 
Algorithms (1969, 1981, 1997), Donald Knuth notes that a number of mathematicians have 
suggested as a definition of randomness that a series of numbers should be able to pass each of 
the statistical tests that have been developed for evaluating randomness. Yet Knuth (1969, 1981, 
1997) notes that it is virtually impossible to identify or generate a random series that will pass 
each and every statistical test that has been developed. Even if one could find such a series, it 
is all but certain that within the series there will be one or more sequences of numbers (often 
quite long in duration) which by themselves will fail one or more of the statistical tests for 
randomness. Peterson (1998) provides an interesting discussion on the limitations of employing 
pseudorandom numbers to simulate naturally occurring random processes. One of the examples 
he cites involves what is considered to be an excellent random number generator developed by 
Marsaglia and Zaman (1994), which is able to pass the most demanding tests of randomness. 
However, the Marsaglia-Zaman random number generator (i.e., an algorithm that generates a 
random sequence) yielded incorrect results in a computer-simulated study of magnetism. Thus, 
even an excellent random number generator may be characterized by peculiarities which may 
compromise its usefulness in simulating certain natural processes. It should be emphasized, 
however, that in spite of the latter, excellent random number generators are available for 
effectively simulating virtually all naturally occurring random processes. 

The single-sample runs test is only one of many tests that have been developed for 
assessing randomness. In point of fact, many sources do not consider runs tests to be particularly 
effective mechanisms for assessing randomness (e.g., Conover (1999) states that runs tests leave 
a lot to be desired as tests of randomness, since they are very low in statistical power.). Section 
IX (the Addendum) describes alternative tests for randomness, as well as presenting some 
algorithms for generating pseudorandom numbers. 
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VIII. Additional Examples Illustrating the Single-Sample 
Runs Test 


As is the case with Example 10.1, Examples 10.3-10.6 all involve series in which N = 20, 
n, = n, = 10, andr- 11. By virtue of employing identical data, the latter examples all yield 
the same result as Example 10.1. Example 10.6 illustrates the application of the single-sample 
runs test to a design involving two independent samples.’ In Examples 10.1 and 10.3-10.5 it 
is implied that if the series involved are, in fact, random, it is probably reasonable to assume in 
the underlying population x, = m, = .5 (in other words, that each alternative has an equal 
likelihood of occurring in the underlying population, even if the latter is not reflected in the 
sample data). Example 10.7 illustrates the application of the single-sample runs test to a design 
in which it is known that in the underlying population t, # m, # .5. 


Example 10.3 A meteorologist conducts a study to determine whether humidity levels recorded 
at 12 noon for 20 consecutive days in July 1995 are distributed randomly with respect to whether 
they are above or below the average humidity recorded during the month of July during the years 
1990 through 1994. Recorded below is a listing of whether the humidity for 20 consecutive days 
is above (+) or below (—) the July average. 


+++---++--+-+-ł+---++ 
Do the data indicate that the series of temperature readings is random? 


Example 10.4 The gender of 20 consecutive patients who register at the emergency room of 
a local hospital is recorded below (where: M = Male; F = Female). 


FFFMMMFFMMFMFMFMMMFEFF 
Do the data suggest that the gender distribution of entering patients is random? 


Example 10.5 A quality control study is conducted on a machine that pours milk into con- 
tainers. The amount of milk (in liters) dispensed by the machine into 21 consecutive containers 
follows: 1.90, 1.99, 2.00, 1.78, 1.77, 1.76, 1.98, 1.90, 1.65, 1.76, 2.01, 1.78, 1.99, 1,76, 1.94, 
1.78, 1.67, 1.87, 1.91, 1.91, 1.89. If the median number of liters the machine is programmed to 
dispense is 1.89, is the distribution random with respect to the amount of milk poured above 
versus below the median value? 


In Example 10.5 it can be assumed that if the process is random the scores should be 
distributed evenly throughout the series, and that there should be no obvious pattern with respect 
to scores above versus below the median. Thus, initially we list the 21 scores in sequential order 
with respect to whether they are above (+) or below (—) the median. Since one of the scores (that 
of the last container) is at the median, it is eliminated from the analysis. The latter protocol is 
employed for all scores equal to the median when the single-sample runs test is used within this 
context. The relationship of the first 20 scores to the median is recorded below. 


t+t---+4+--+-4+-4+---++4 


Since the above sequence of runs is identical to the sequence observed for Examples 10.1, 
10.3, and 10.4, it yields the same result. Thus, there is no evidence to indicate that the 
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distribution is not random. Presence of a nonrandom pattern due to a defect in the machine can 
be reflected in a small number of large cycles (i.e., each cycle consists of many trials). Thus, one 
might observe 10 consecutive containers that are overfilled followed by 10 consecutive 
containers that are underfilled. A nonrandom pattern can also be revealed by an excess of runs 
attributed to multiple small cycles (1.e., each cycle consists of few trials). 

The reader should take note of the fact that although the Wilcoxon signed-ranks test (Test 
4) can also be employed to evaluate the data for Example 10.5, it is not appropriate to employ 
the latter test for evaluating a hypothesis regarding randomness. The Wilcoxon signed-ranks 
test can be used to evaluate whether or not the data indicate that the true median value for the 
machine is some value other than 1.89. It does not provide information concerning the ordering 
of the data. It should also be noted that the data for Example 10.5 are identical to that employed 
for Example 10.2. Note that in the latter example the runs test for serial randomness was 
employed to evaluate the hypothesis of randomness (in reference to increments and decrements 
of liters on successive trials), and also concluded that the evidence did not suggest a lack of 
randomness. 


Example 10.6 In a study on the efficacy of an antidepressant drug, each of 20 clinically de- 
pressed patients is randomly assigned to one of two treatment groups. For 6 months one group 
is given the antidepressant drug and the other group is give a placebo. After 6 months have 
elapsed, subjects in both groups are rated for depression by a panel of psychiatrists who are 
blind with respect to group membership. Each subject is rated on a 100 point scale (the higher 
the rating, the greater the level of depression). The depression ratings for the two groups follow. 


Drug group: 20, 25, 30, 48, 50, 60, 70, 80, 95, 98 
Placebo group: — 35, 40, 42, 52, 55, 62, 72, 85, 87, 90 


Do the data indicate there is a difference between the groups? 


Since it is less powerful than alternative procedures for evaluating the same design (which 
typically contrast groups with respect to a measure of central tendency), the single-sample runs 
test is not commonly employed in evaluating a design involving two independent samples. 
Example 10.6 will, nevertheless, be used to illustrate its application to such a situation. In order 
to implement the runs test, the scores of the 20 subjects are arranged ordinally with respect to 
group membership as shown below (Where D represents the Drug group and P represents the 
Placebo group). 


20 25 30 35 40 42 48 50 52 55 60 62 70 


72 
DDDPPPDDPPDPDP 


0 85 87 90 95 98 


80 8 
DPPPDD 

Runs are evaluated as in previous examples. In this instance, the two categories employed 
in the series represent the two groups from which the scores are obtained. When the scores are 
arranged ordinally, if there is a difference between the groups it is expected that most of the scores 
in one group will fall to the left of the series, and that most of the scores in the other group will 
fall to the right of the series. More specifically, if the drug is effective one will predict that the 
majority of the scores in the Drug group will fall to the left of the series. Such an outcome will 
result in a small number of runs. Thus, if the number of runs is equal to or less than the tabled 
critical lower limit at the prespecified level of significance, one can conclude that the pattern of 
the data is nonrandom. Such an outcome will allow the researcher to conclude that there is a 
significant difference between the groups. Since n, = n, = 10 and r = 11, the data for 
Example 10.6 are identical to that obtained for Examples 10.1, 10.3, and 10.4. Analysis of the 
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data do not indicate that the series is nonrandom, and thus one cannot conclude that the groups 
differ from one another (i.e., represent two different populations). 

Let us now consider two other possible patterns for Example 10.6. The first pattern 
depicted below contains r = 2 runs. It yields a significant result since in Table A8, for 
n, = n, = 10, any number of runs equal to or less than 6 is significant at the .05 level. Thus, 
the pattern depicted below will lead the researcher to conclude that the groups represent two 
different populations. 


DDDDDDDDDDPPPPPPPPPP 
Consider next the following pattern: 
DDDDDPPPPPPPPPPDDDDD 


Since the above pattern contains r = 3 runs, it is also significant at the .05 level. Yet 
inspection of the pattern suggests that a test which compares the mean or median values of the 
two groups will probably not result in a significant difference. This is based on the observation 
that the group receiving the drug contains the five highest and five lowest scores. Thus, if the 
performance of the Drug group is summarized with a measure of central tendency, such a value 
will probably be close to the analogous value obtained for the placebo group (whose scores 
cluster in the middle of the series). Nevertheless, the pattern of the data certainly suggests that 
a difference with respect to the variability of scores exists between the groups. In other words, 
half of the people receiving the drug respond to it favorably, while the other half respond to it 
poorly. Most of the people in the placebo group, on the other hand, obtain scores in the middle 
of the distribution. The above example illustrates the fact that in certain situations the single- 
sample runs test may provide more useful information regarding two independent samples than 
other tests which are more commonly used for such a design — specifically, the f test for two 
independent samples (Test 11) and the Mann-Whitney U test (Test 12), both of which 
evaluate measures of central tendency. The pattern of data depicted for the series under dis- 
cussion is more likely to be identified by a test that contrasts the variability of two independent 
samples. In addition to the single-sample runs test, other procedures (discussed later in the 
book) that are better suited to identify differences with respect to group variability are the Siegel- 
Tukey test of equal variability (Test 14) and the Moses test for equal variability (Test 15). 


Example 10.7 A quality control engineer is asked by the manager of a factory to evaluate a 
machine that packages glassware. The manager informs the engineer that 9096 of the glassware 
processed by the machine remains intact, while the remaining 1096 of the glassware is cracked 
during the packaging process. It is suspected that some cyclical environmental condition may 
be causing the machine to produce breakages at certain points in time. In order to assess the 
situation, the quality control engineer records a series comprised of 1000 pieces of glassware 
packaged by the machine over a two-week period. It is determined that within the series 890 
pieces of glassware remain intact and that 110 are cracked. It is also determined that within the 
series there are only 4 runs. Do the data indicate the series is nonrandom? 


In Example 10.7, N - 1000, n, - 890, n, - 110, and r - 4. Employing these values 
in Equation 10.1, the value z = -3.12 is computed. In employing the latter equation, the 
computed value for the expected number of runs is p, = 196.8, which is well in excess of the 
Observed value r - 4. 
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| (2)(890)(110)(2)(890)(110) - 890 - 110] 
(890 + 110) (890 + 110 - 1) 

Employing Table A8, we determine that the absolute value z = 3.12 is greater than the 
tabled critical two-tailed values zo, = 1.96 and z,, = 2.58, and the tabled critical one-tailed 
values zo, = 1.65 and z,, = 2.33. Thus, the nondirectional alternative hypothesis is supported 
at both .05 and .01 levels. Since the obtained value of z is negative, the directional alternative 
hypothesis predicting too few runs is supported at the both the .05 and .01 levels. Obviously, the 
directional alternative hypothesis predicting too many runs is not supported. 

Note that in Example 10.7 the plant manager informs the quality control engineer that 
the likelihood of a piece of glassware cracking is 1, = .9, whereas the likelihood of it being 
intact is m, = .1. Thus, each of the two alternatives (Intact versus Cracked) are not equally 
likely to occur on a given trial. The observed proportion of cases in each of the two categories 
p, = 890/1000 = .89 and p, = 110/1000 = .11 are quite close to the values x, = .9 and 
T, = .1. As noted earlier in the discussion of the single-sample runs test, if the observed 
proportions are substantially different from the values assumed for the population proportions, 
the test will not identify such a difference, and the analysis of runs will be based on the observed 
values of the proportions in the series, regardless of whether or not they are consistent with the 
underlying population proportions. The question of whether the proportions computed for the 
sample data are consistent with the population proportions is certainly relevant to the issue of 
whether or not the series is random. However, the binomial sign test for a single sample (Test 
9) and the chi-square goodness-of-fit test (Test 8) are the appropriate tests to employ to 
evaluate the latter question. 





IX. Addendum 


1. The generation of pseudorandom numbers In Section VII the distinction between 
random and pseudorandom numbers was discussed. In this section a number of pseudorandom 
number generators will be described. Various sources (e.g., Banks and Carson (1984), Phillips 
et al. (1976), and Schmidt and Taylor (1970)) note that an effective random number generator is 
characterized by the following properties: a) The numbers generated should conform as closely 
as possible to a uniform distribution. By the latter it means that each of the possible values in 
the distribution must have an equal likelihood of occurring. Thus, if each of the random numbers 
is an integer value between 0 and 9, each of the ten digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 must have an 
equal likelihood of occurring; b) The random number generator should have a long period. The 
period is the number of random numbers that are generated before the sequence begins to repeat. 
The term cycling is used when a sequence of numbers begins to repeat itself; c) A good generator 
should not degenerate. Degeneracy is when a random number generator at some point con- 
tinually produces the same number; d) Since a researcher may wish to repeat the same experiment 
with the same set of numbers, a good random number generator should allow one to reproduce 
the same sequence of numbers, as well as having the ability to produce a unique sequence of 
numbers each time it is run; and e) The structure of a random number generator should be such 
that it can be executed quickly by a computer, and not utilize an excessive amount of computer 
memory. Most random number generators begin with initial values called a seed, constants, 
and/or a modulus. Certain characteristics of these values (such as their magnitude or whether 
they are a prime number?) may be critical in determining the quality of the random series that will 
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be produced by a generator. At this point a number of different types of random number gen- 
erators will be described. Keep in mind that since quality random number generators involve the 
use of very large numbers and employ an excessive number of iterations (an iteration is the 
repetitive use of the same equation or set of mathematical operations), they require the use of a 
computer. 


The midsquare method The first method for generating pseudorandom numbers was the 
midsquare method developed by John von Neumann in 1946 for simulation purposes. The 
midsquare method begins with a seed number, which is an initial value that the user arbitrarily 
selects. The first step of the midsquare method requires that the seed is squared, and the first 
random number will be the middle r digits of the resulting number. Each subsequent number is 
obtained by squaring the previous number, and once again employing the middle r digits of the 
resulting number as the next random number. The process is continued until the sequence of 
numbers cycles or degenerates. 

To illustrate the midsquare method, assume that the seed number is 4931. The square of 
493] is 24314761. The middle four digits 3147 will represent the first random number. When 
we square 3147, we obtain 9903609. The middle four digits of the latter number 0360 will 
represent the next random number. When we square 0360, we obtain 129600. The middle four 
digits of the latter number 2960 will represent the next random number. As noted above, the 
process is continued until the sequence of numbers cycles or degenerates. Since the midsquare 
method tends to degenerate rapidly (resulting in a short period), it is seldom used today. 


The midproduct method Although the midproduct method is superior to the midsquare 
method, it has a relatively short period compared to the best of the random number generators 
that are employed today. In the midproduct method one starts with two seed numbers (to be 
designated m, and m, ), each number containing the same number of digits. The values m, and m, 
are multiplied, and the middle r digits of the resulting value, designated m, , are used to represent 
the first random number. m, is now multiplied by m,, and the middle r digits of the resulting 
number are designated as the second random number. The process is continued until the 
sequence of numbers cycles or degenerates. 

To illustrate the midproduct method, assume we begin with the two seed numbers 4931 
and 7737, which when multiplied yield 38151147. The middle four digits of the latter 
number are 1511, which will represent the first random number. We next multiply the second 
seed (7737) by 1511, which yield 11690607. The middle four digits of the latter number are 
6906, which will represent the next random number. We next multiply 1511 by 6906 yielding 
10434966. The middle four digits of the latter number are 4349, which represent the next 
random number. As noted above, the process is continued until the sequence of numbers cycles 
or degenerates. 

A variant of the midproduct method employs just one seed and another value called a 
constant multiplier. The seed is multiplied by the constant multiplier, and the first random 
number is the middle r digits of the resulting product. That value is then multiplied by the 
constant multiplier, and the second random number is the middle r digits of the resulting product, 
and so on. The process is continued until the sequence of numbers cycles or degenerates. 

To illustrate this variant of the midproduct method, assume we begin with the seed 4931 
and the constant multiplier 7737. When we multiply these two values we obtain 38151147. The 
middle four digits of the latter number are 1511, which will represent the first random number. 
We next multiply 1511 by the constant multiplier 7737 obtaining 11690607. The middle four 
digits of the latter number are 6906, which will represent the next random number. The value 
6906 is multiplied by the constant multiplier 7737 yielding 53431722. The middle four digits of 


© 2000 by Chapman & Hall/CRC 


the latter number are 4317, which will represent the next random number. As noted above, the 
process is continued until the sequence of numbers cycles or degenerates. 

The Linear congruential method  Congruential methods are the most commonly used 
mechanisms employed today for generating random numbers. Congruential random number 
generators employ modular arithmetic, which means that they employ as a random number the 
remainder that results after dividing one number by another. To describe the use of modular 
arithmetic within the context of the congruential method, we will let the notation mod represent 
modulus. The notation y mod m means that some number designated by the symbol y is divided 
by the modulus which is represented by the value m. Whatever remainder results from this 
division will be employed as a random number. 

The linear congruential method, which was developed by the mathematician Derrick 
Lehmer, is probably the most commonly used of the random number generators that are based 
on the congruential method. The linear congruential method produces random numbers that fall 
in the range 0 to m — 1. It is based on the following recursive relationship (a recursive 
relationship is one in which each result is computed by employing the information from the 
previous result): x; , , = (ax, + c) mod m. In the aforementioned equation, a is a constant 
multiplier, c is referred to as the additive constant or increment, and m is the modulus. All 
of the aforementioned values remain unchanged each time the equation is employed. The initial 
value of x, (which we will refer to as x) will be the seed. When the equation is employed the 
first time, the seed (x, ) is multiplied by the constant multiplier (a) and the additive constant (c) 
is added to the product. The latter value is divided by the modulus (m), and the remainder after 
the division represents the value on the left side of the equation (x, , ,). This latter value will 
represent the first random number (x,). The equation is then employed again, using the value 
x, in the right side of the equation to represent x,. The resulting value will represent the second 
random number. This process is continued until the sequence of numbers cycles or degenerates. 

To illustrate the linear congruential method, assume we begin with the following values: 
Seed = x, = 47; Constant multiplier = a = 17; Additive constant = c =79; Modulus 
= 100. Our initial equation is thus, x, = [(17)(47) + 79] mod 100. The computed value of 
X, = 78, since (17)(47) + 79 = 878, which when divided by 100 yields 8 with a remainder of 78. 
Thus 78 will represent the first random number. The equation is employed again using the value 
X, = 78 torepresent x, on the right side of the equation. Thus, x, = [(17)(78) + 79] mod 100 
which yields 1405 divided by the modulus of 100. When 1405 is divided by 100, we obtain 
14 with a remainder of 5. Thus, 5 is our second random number. The equation is then employed 
again using the value x, = 5 to represent x, on the right side of the equation. Thus, 
x, = [(17)(5) + 79] mod 100 which yields 164 divided by the modulus of 100. When 164 is 
divided by 100, we obtain 1 with a remainder of 64. Thus, 64 is our third random number. As 
noted above, this process is continued until the sequence of numbers cycles or degenerates. It 
should be noted that in order to generate a sequence of random numbers that are of high quality 
with a congruential generator, the value of the modulus must be quite large. Bennett (1998) 
notes that Lehmer suggested the value 2,147,483,647 (which is equivalent to (2?! — 1)) for the 
modulus of congruential generators, and the latter value is, in fact, employed in many linear 
congruential generators. The values of the constant multiplier and additive constant vary from 
generator to generator. Bang et al. (1998) note that when the additive constant c = 0, the term 
multiplicative congruential generator is employed in reference to a linear congruential 
generator. In point of fact, most of the commonly used congruential random number generators 
are multiplicative congruential (i.e, since c = 0, the congruential equation becomes 
X,,, = ax, mod m). 

Note that in all of the random number generators that have been described in this section, 
an integer number has been employed to represent each random number. In the case of the 
midsquare and midproduct methods, each value generated was a four-digit number that fell in the 
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range 0000 to 9999, If we had wanted to, we could have selected the middle two or three digits 
or just one digit (instead of the middle four digits) from the product that was derived from 
multiplication. Also, by increasing the size of the seed(s) and/or the constant multiplier, a larger 
product can be obtained allowing one to select the middle six, seven eight, etc. digits as a random 
number. Often when random generators are employed, the range of random numbers one desires 
may be very limited. Let us assume that we generate four-digit numbers as illustrated above, but 
want our sequence of random numbers to be comprised of two-digit numbers. The latter can be 
easily achieved by breaking each four-digit number into two numbers comprised of two digits, 
each number falling in the range 00 to 99. If one-digit numbers are desired, four one-digit 
numbers can be extracted from the four-digit number. If one is interested in a binary series of 
numbers, in which the only values employed are 0 and 1, each odd digit can be employed to 
represent one alternative and each even digit the other alternative. If one only wants three values, 
the digits 0, 1, 2 can be employed to represent one value, 3, 4, 5, a second value, and 6, 7, 8 a 
third value (the digit 9 would just be ignored when it occurs). By using the aforementioned 
logic, the number(s) generated can be formatted to represent however many alternatives one 
wants to employ within a sequence of random numbers. 

In actuality, most random number generators return a value that falls within the range 0 to 
1. If one wished to convert any of the random numbers derived in this section into a value that 
falls within that range, each value can be divided by 10 raised to the appropriate power. In other 
words, if we take the random number 1511 generated earlier and divide it by 10000, we obtain 
.1511. If we break 1511 into the two numbers 15 and 11, the latter values can be converted into 
.15 and .11 by dividing them by 100. Congruential generators typically return decimal values 
to represent random numbers. Thus, instead of expressing the remainder as an integer (as was 
done above), a decimal formatis employed. For example, consider the value 78 used to represent 
the first random number generated by a linear congruential generator. The latter value was the 
remainder when 878 was divided by 100. The usual way of expressing the result of the division 
878/100 is 8.78. The decimal part of the result, .78, can be employed to represent a random 
number in the range 0 to 1. 

If one happens to be employing a random number generator that yields values between 0 
and 1, itis easy to convert the latter values into integer numbers that fall within a specified range 
of values. As an example, let us assume a computer generates a series of random numbers that 
are in the range between 0 and 1, and that we wish to convert each number into an integer value 
between 1 and 6 in order to simulate the throwing of a die. If we multiply each of the numbers 
that fall in the range 0 to 1 by 6 (i.e., the value that represents the largest integer value) and 
employ the integer part of the result (1.e., the number to the left of the decimal) with one unit 
added, we will create a random series of integer numbers that fall in the range 1 to 6. (The only 
exception to this will be if the random number generated equals 1, which when multiplied by 6 
with 1 added will yield the value 7.) To illustrate, if the first random number generated by the 
computer is .9888 and we multiply the latter number by 6 we obtain 5.9328. Since the value 5 
represents the integer number to the left of the decimal, we add one unit to it making it the 
number 6. The latter value will represent the first outcome for the die. If the next random 
number the computer generates is .0321, the latter value multiplied by 6 equals .1926. Since the 
value to the left of the decimal is a zero, we add 1 to it, and the resulting value of 1 will represent 
the next outcome for the die. 

Bang et al. (1998) note that at the current time there are three types of random number 
generators that are commonly employed. The first kind are congruential random number gen- 
erators which were discussed earlier. The other two types of random number generators that 
are frequently used are: a) The shift-register generator (also known as the Tausworthe gener- 
ator), which employs the binary structure of computers; and b) The Fibonacci generator (also 
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known as the additive generator). Representative of the latter type of generator is the 
Marsaglia-Zaman method, which employs the Fibonacci sequence. The latter is a sequence of 
numbers in which, except for the first value, every number is the sum of the previous two 
numbers. Thus: 1, 1, 2, 3, 5, 8, 13, 21, etc. Peterson (1998) describes how the Fibonacci 
sequence was utilized by Marsaglia and Zaman to generate a random numbers series with an 
exceptionally long period. Bang et al. (1998), Banks and Carson (1984), Bennett (1998), 
Gruenberger and Jaffray (1965), James (1990), Knuth (1969, 1981, 1997), Peterson (1998), 
Phillips et al. (1976), and Schmidt and Taylor (1970) are sources which can provide the reader 
with a more detailed description and/or critique of random number generators. 


2. Alternative tests of randomness? The single-sample runs test is one of many tests that 
have been developed for assessing randomness. Most of the alternative procedures for assessing 
randomness allow one to evaluate series in which, on each trial, there are two or more possible 
outcomes. A general problem with tests of randomness is that they do not employ the same 
criteria for assessing randomness. As a result of this some tests are more stringent than others, 
and thus it is not uncommon that a series of numbers may meet the requirements of one or more 
of the available tests of randomness, yet not meet the requirements of one or more of the other 
tests. This section will discuss some of the more commonly employed tests for evaluating 
random number sequences. 


Test 10b: The frequency test The frequency test (also known as the equidistribution test), 
which is probably the least demanding of the tests of randomness, assesses randomness on the 
basis of whether k or more equally probable alternatives occur an equal number of times within 
aseries. The data for the frequency test are evaluated with the chi-square goodness-of-fit test, 
and when k = 2 the binomial sign test for a single sample (as well as the large sample normal 
approximation) can be employed. The Kolmogorov-Smirnov goodness-of-fit test for a single 
sample (Test 7) can also be employed to assess the uniformity of the scores in a distribution. 
The interested reader should consult Banks and Carson (1984) and Schmidt and Taylor (1970) 
for a description of how the latter test is employed within this context. 

Since the frequency test only assesses a series with respect to the frequency of occurrence 
of each of the outcomes, it is insensitive to systematic patterns that may exist within a series. To 
illustrate this limitation ofthe frequency test, consider the following two binary series consisting 
of Heads (H) and Tails (T), where 1, = m, = .5. 

Series A: HHHTHTTTHTHHTHTTTHTH 
Series B: HTHTHTHTHTHTHTHTHTHT 


Inspection of the data indicates that both Series A and B are comprised of 10 Heads and 
10 Tails. Since the number of times each of the alternatives occurs is at the chance level (i.e., 
each occurs in 50% of the trials), if one elects to analyze either series employing either the chi- 
square goodness-of-fit test or the binomial sign test for a single sample, both series will meet 
the criterion for being random. However, visual inspection of the two series clearly suggests 
that, as opposed to Series A, Series B is characterized by a systematic pattern involving the 
alternation of Heads and Tails. This latter observation clearly suggests that Series B is not 
random. 

At this point, the frequency test and the single-sample runs test will be applied to the 
same set of data. Specifically, both tests will be employed to evaluate whether or not the series 
below (which consists of N = 30 trials) is random. In the series, each of the runs has been 
underlined. 


HHHHHTHTHHHTHTHTHHH TTITHHHHHHH 
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Since k = 2, the binomial sign test for a single sample will be employed to represent the 
frequency test. When the binomial sign test is employed, the null hypothesis that is evaluated 
is: Hj: m, = .5 (since it is assumed that x, = m, = .5). For both the binomial sign test for 
a single sample and the single-sample runs test, it will be assumed that a nondirectional 
alternative hypothesis is evaluated. 

In the series that is being evaluated, the number of Heads is n, = 21 and the number of 
Tails is n, = 9. Employing Equation 9.9 (which is the continuity-corrected normal approxi- 
mation for the binomial sign test for a single sample), the value z = 2.01 is computed. 


zz (2D @OG) |S 
(30)(.5)(.5) 


2.01 


Since the obtained value z = 2.01 is greater than the tabled critical two-tailed value 
Zos = 1.96, the nondirectional alternative hypothesis H,: T, # .5 is supported. Thus, based 
on the above analysis with the binomial sign test for a single sample, one can conclude that the 
series is not random. 

In evaluating the same series with the single-sample runs test, we determine that there are 
r= 13 runs in the data. The expected number of runs is [[2)(21)(9)]/(21 + 9)] + 1 = 13.6, which 
is barely above the observed value r 2 13. Employing Equation 10.1, the value z = -.27 is 
computed. ^ 


dies M " j 
21 +9 
QQIO[20QD) - 21 - 9] 
(21 + 9)(21 + 9 - 1) 


2-27 


Since the obtained absolute value z = .27 is less than the tabled critical two-tailed value 
Zos = 1.96, the nondirectional alternative hypothesis for the runs test is not supported. Thus, 
based on the above analysis with the single-sample runs test, one can conclude that the series 
is random. 

The fact that a significant result is obtained when the binomial sign test for a single 
sample is employed to evaluate the series, reflects the fact that the latter test only takes into 
account the number of observations in each of the two categories, but does not take into 
consideration the ordering of the data. The single-sample runs test, on the other hand, is 
sensitive to the ordering of the data, yet will not always identify a nonrandom series if 
nonrandomness is a function of the number of outcomes for each of the alternatives. 

There will also be instances where one may conclude a series is random based on an 
analysis of the data with the single-sample runs test, yet not conclude the series is random if the 
binomial sign test for a single sample is employed for the analysis. Such a series, consisting 
of 15 Heads and 15 Tails, is depicted below. 


HHHHHHHHHHHHHHH TTTTTTTTTTTTTTT 


When Equation 9.7 (the normal approximation of the binomial sign test for a single 
sample) is employed, it yields the following result: z = [15 - (30)(.5)]//(30)(.5)(.5) = 0. 
Equation 9.9, the continuity-corrected equation, yields the absolute value z 2.18. Since the latter 
values are less than the tabled critical two-tailed value zy, = 1.96, the result is not significant, 
and thus one can conclude that the series is random. When, however, the same series, which 
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consists of only two runs, is evaluated with Equation 10.1 (the equation for the normal ap- 
proximation of the single-sample runs test), it yields the following result: z = (r - p,)/o, 
= (2 — 16)/2.69 = -5.20. Since the absolute value z = 5.20 is greater than the tabled critical two- 
tailed values Zø = 1.96 and Z= 2.58, the result is significant at both the .05 and .01 levels. 
Thus, if one employs the single-sample runs test, one can conclude that the series is not random. 

One final set of data will be evaluated with the frequency test. The data to be presented 
(which were generated by a computer program) will also be evaluated with three other tests of 
randomness which will be presented in this section — specifically, the gap test, the poker test, 
and the maximum test. Keep in mind that in computer simulation research, the number of digits 
that comprise a series of random numbers will typically be more than 120 digits used in the data 
set to be presented. When random number generators are evaluated, millions or even billions of 
digits are generated and analyzed. One should keep in mind that within a random series of 
millions of numbers, there will undoubtedly be sequences of shorter duration that in and of them- 
selves would not pass a test for randomness. In any event, the series presented below consists 
of 120 digits in the range 0-9. 


8, 9, 3, 7, 2, 3, 0, 2, 3, 1, 4, 7, 8, 5. 6, 2, 0, 9, 6, 8, 7, 5, 3, 0, 7, 8, 9, 6, 3, 5, 
9, 9, 8, 4, 6, 3, 7, 9, 1, 0, 8, 3, 7, 6, 1, 0, 0, 3, 8, 9, 5, 6, 6, 7, 4, 1, 2, 0, 3, 6, 
7, 8, 8, 8, 9, 9, 4, 5, 3, 3, 1, 1, 1, 6, 0, 0, 8 7, 7, 3, 9, 7, 5, 2, 0, 3, 8, 6, 0, 4, 
6, 3, 0, 2, 8 6, 7, 0, 0, 1, 2, 5, 0, 5, 7, 9, 0, 8, 6, 4, 3, 2, 5, 8, 9, 6, 1, 0, 7, 8 


The data are evaluated with the chi-square goodness-of-fit test. There are 10 categories, 
one corresponding to each of the 10 digits. Each digit followed by its observed frequency in the 
120 digit series is presented: 0 (17); 1 (9); 2 (8); 3 (15); 4 (6); 5 (9); 6 (14); 7 (14); 8 (16); 9 (12). 
If (as it is employed in Equation 8.1) n represents the total number of observations, the expected 
frequency for each digit is E, = nm; = (120)(.1) = 12. Since the chi-square analysis will be 
based on k = 10 categories/cells, the degrees of freedom will be df= k- 1 = 10-1 =9. When 
the data are evaluated with Equation 8.2, the value X? = 10.65 is computed. Table 10.1 
provides a summary of the analysis. 


Table 10.1 Summary of Chi-Square Analysis for Frequency Test 


Observed Expected (O - EY. 
Cell/Digit Frequency Frequency ———— 
c O ë (0. 1 101. 9 o 
0 17 12 2.08 
1 9 12 75 
2 8 12 1.33 
3 15 12 75 
4 6 12 3.00 
5 9 12 75 
6 14 12 33 
7 14 12 33 
8 16 12 1.33 
9 12 12 00 
Sums 120 120 x? = 10.65 


Employing Table A4 (Table of the Chi-Square Distribution) in the Appendix, for 
df=9, the tabled critical values are y^, = 16.92 and y^, = 21.67. Sincethe obtained value y? = 10.65 
is less than Xs = 16.92, the null hypothesis is retained. Thus, the data are consistent with the 
series being random. 
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Test 10c: The gap test Described in Banks and Carson (1984), Gruenberger and Jaffray 
(1965), Knuth (1969, 1981, 1997), Phillips et al. (1976), and Schmidt and Taylor (1970), the gap 
test evaluates the number of gaps between the appearance of a digit in a series and the 
reappearance of the same digit. Thus, if we have k = 10 digits and each of the digits is equally 
likely to occur, it would be expected that if the distribution of digits in a series consisting of n 
digits is random, the average gap/interval for the reoccurrence of each digit will equal k = 10. 
A gap for any digit can be determined by selecting that digit and counting until the next 
appearance of the same digit. To illustrate the concept of a gap, consider the following series of 
digits: 0121046720. For the digit 0 we can count two gaps of lengths 3 and 4 respectively. This 
is the case, since the number of digits between the first 0 and the second 0 is 3, and the number 
of digits between the second 0 and the third 0 is 4. In conducting the gap test, all of the gaps for 
each of the digits in the series are counted, after which the computed gap values are evaluated. 
The analysis of the data for a series within the framework of the gap test can employ one or more 
statistical tests that have been discussed in this book: a) The single-sample z test (Test 1) can 
be employed to contrast the computed mean gap value versus the expected mean gap value for 
each digit; b) The single-sample chi-square test for a population variance (Test 3) can be 
employed to contrast the observed versus expected variance of the gaps values for each digit; and 
c) The chi-square goodness-of-fit test can be employed to compare the observed versus 
expected gap lengths of a specific value for each or all of the digits separately or together. 

To illustrate some of the analyses that can be conducted on gap values, we will employ the 
same 120 digit series evaluated earlier with the frequency test. The number of gaps for any digit 
in a series will be one less than the frequency of occurrence of that digit in the series. Con- 
sequently the total number of gaps in the series will be n — k (i.e., the total number of digits 
generated less the number of digit categories employed). Thus, in the case of a 120 digit series 
employing 10 digit values, the total number of gaps in the series will equal 120 — 10 2 110. It 
was noted in the frequency test analysis of the 120-digit series, that the digit 0 appears 17 times. 
Thus, there are 16 gaps for the value 0. Inspection of the series will reveal that the first gap 
length is 9, the second 6, and so on. The 16 gap lengths for the digit 0 are as follows: 4 gaps of 
length 3, 3 gaps of length 0, 2 gaps of length 10, and 1 gap of lengths 4, 5, 6, 8, 9, 15, and 16. 
The computed mean (through use of Equation I.1) and estimated population variance (through 
use of Equation L5) of the 16 gap lengths are X = 5.94 and $? = 25.00. The expected value 
for the mean (u) is equal to k, the number of digit categories. Thus, u = k = 10. The expected 
value for the variance of the gaps for a digit is o? = k(k - 1). Thus, for each of the ten digits 
the expected variance is o? = 10(10 - 1) = 90. 

With respect to the data for the digit 0, the single-sample z test (employing Equation 1.3) 
will be used to evaluate the null hypothesis that for the digit 0, gap values in the sample are 
consistent with a population that has a mean value of 10. The single-sample chi-square test for 
a population variance (employing Equation 3.2) will be used to evaluate the null hypothesis that 
for the digit, 0 gap values in the sample are consistent with a population that has a variance of 
90. In arandom distribution both of the aforementioned null hypotheses would not be rejected, 
not only in the case of the digit 0, but also in the case of any of the other nine digits. 

When the single-sample z test is employed to evaluate the hypothesis about the mean 
value of the gaps for the digit 0, it yields the value z 2 —1.69. Note that the value 9.49 employed 
in Equation 1.3 represents the square root of the expected population variance (i.e., 
Vo? = (90 = 9.49. The value n = 16 employed in computing ox with Equation 1.2, represents 
the number of gaps. 


© 2000 by Chapman & Hall/CRC 


Since the absolute value z = 1.69 is greater than the tabled critical one-tailed value 
Zo5= 1.65, the directional alternative hypothesis predicting that the sample came from a popu- 
lation with a mean gap value less than 10 (since the value of z is negative) is supported at the .05 
level, but not at the .01 level (since z = 1.69 is less than the tabled critical one-tailed value 
Zo, = 2.33). Since the absolute value z = 1.69 is less than the tabled critical two-tailed value 
Zos = 1.96, the nondirectional alternative hypothesis predicting that the sample came from a 
population with mean gap value other than 10 is not supported. However, the fact that the one- 
tailed analysis yields a significant result suggests that the average gap value of X = 5.94 for the 
digit 0 is inconsistent with what one would expect in a random distribution of digits. 

When the single-sample chi-square test for a population variance is employed to 
evaluate the hypothesis about the variance of the gaps for the digit 0, it yields the value 
xX = 4.17. 


2 _ (n-1)5? (16- 1)(25.00) _ "m 
o! 90 l 


X 7 


Employing Table A4, for df = n - 1 = 15, the tabled critical one-tailed values in the 
lower tail of the chi-square distribution are Xs = 7.26 and Xo = 5.23, and the tabled critical 
two tailed values are Xas = 6.26 and Xo = 4.60 (the lower tail is employed since the value 
computed for the estimated population variance is less than the hypothesized population 
variance). Since the obtained value y? = 4.17 is less than all of the aforementioned critical 
values, the null hypothesis is rejected, regardless of which alternative hypothesis is employed. 
The data seem to clearly suggest that the sample came from a population with a variance gap 
value less than 90. Thus, the result of the evaluation of the mean and variance of the digit 0 is 
inconsistent with what would be expected in a random distribution of digits. 

The above described analyses can also be employed to evaluate the means and estimated 
variances of the other nine digits. However, as noted in Section VI of the chi-square goodness- 
of-fit test, when a large number of tests are conducted on a set of data, the likelihood of com- 
mitting at least one Type I error increases dramatically. The general issue of conducting multiple 
tests is discussed in detail in Section VI of the single-factor between-subjects analysis of 
variance (Test 21). Nevertheless, it is likely that tests for multiple digits will yield significant 
results in the case of analyzing the means and/or variances of gap values for specific digits in a 
distribution that is not random. 

A common test for randomness that evaluates gaps contrasts the observed frequencies of 
different gap values with their expected frequencies. In a distribution in which there are k = 10 
digit categories, the probability that a gap of length r will occur (for any digit) is 
P(r) = (.1)(.9)'. Thus, the likelihood of finding for any of the ten digits a gap of length 0 will 
be (.1)(.9)°=.1. The likelihood of finding a gap of length 1 will be (.1)(.9)'=.09. The likelihood 
of finding a gap of length 2 will be (.1)(.9) 2.081, and so on. The chi-square goodness-of-fit 
test, which will be demonstrated below, can be employed to compare the values of the observed 
and expected gap lengths for the whole series. If a significant difference is obtained, it would 
warrant the conclusion that the series is not random. The Kolmogorov-Smirnov goodness-of- 
fit test for a single sample can also be employed to assess the distribution of gap lengths within 
aseries. The interested reader should consult Banks and Carson (1984) and Schmidt and Taylor 
(1970) for a description of how the latter test is employed within this context. 

Table 10.2 summarizes the analysis of data of the 120 digit series with the chi-square 
goodness-of-fit test. The first column of the table lists 17 gap-length categories, which corres- 
pond to the 17 cells/categories in the chi-square table. It should be noted that the grouping of 
gaps into 17 categories is arbitrary. It is unlikely (although not impossible) that an alternative 
grouping format (1.e., a fewer or greater number of categories) will yield a substantially different 
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result for the chi-square analysis. Each row in Table 10.2 lists in Column 2 the observed number 
of gaps in the data with lengths that correspond to the values listed in the first column of that 
row. The expected number of gaps for the lengths listed are noted in Column 3, followed in 
parenthesis by the probability of obtaining gap values of the designated lengths. The latter 
probabilities were obtained by adding up the probability values for each of the gap lengths 
specified in a given row. Thus, the value .271 in the first row is the result of adding the values 
.1, .09, and .081, which are the values computed above (using the equation P(r) = (.1)(.9)" ) 
for the gap lengths 0, 1, and 2. The expected frequency in each row is computed by multiplying 
the expected row probability (P(r)) by 110 (which is the total number of gaps). Thus, in the case 
of the first row, (110)(.271) = 29.81. 


Table 10.2 Summary of Chi-Square Analysis for Gap Test 


Observed Expected 


2 
our number of gaps number of gaps (9 = E) 
(0) (E) & (P) 

0-2 18 29.81 (.271) 4.68 

3-5 21 21.67 (1.97) .02 

6-8 25 15.84 (.144) 5.30 
9-11 15 11.55 (.105) 1.03 
12-14 13 8.36 (.076) 2.58 
15-17 6 6.27 (.057) .01 
18-20 5 4.51 (.041) .05 
21-23 2 3.30 (.030 51 
24-26 3 2.31 (.021) 21 
27-29 1 1.76 (.016) .33 
30-32 0 1.21 (011) 1.21 
33-35 0 .99 (.009) .99 
36-38 0 .66 (.006) .66 
39-41 1 .44 (.004) 71 
42-44 0 .33 (.003) .33 
45-47 0 .33 (.003) .33 
48-50 0 .33 (.003) .33 
Sums 110 110 X = 19.28 


The degrees of freedom for the chi-square table equals 16, since it is one less than the 
number of cells (which equals 17). In order to reject the null hypothesis, and thus conclude that 
the distribution is not random, it is required that the computed value of chi-square is equal to or 
greater than the tabled critical value at the prespecified level of significance. Employing Table 
A4 for df= 16, Xas = 26.30 and Xo = 32.00. Since the obtained value x? = 19.28 is less 
than Xs = 26.30, the null hypothesis is retained. Thus, the data are consistent with the series 
being random. Note, however, that the results of the above gap test are not consistent with the 
gap test analysis conducted previously with the single-sample z test and the single-sample chi- 
square test for a population variance (on the gap values of the mean and variance for the digit 
0). Itis, however, consistent with the frequency test on the same series of numbers, which also 
did not find evidence of nonrandomness. 


Test 10d: The poker test Described in Banks and Carson (1984), Gruenberger and Jaffray 
(1965), Knuth (1969, 1981, 1997), Phillips et al. (1976), and Schmidt and Taylor (1970), the 
poker test conceptualizes a series of digits as a set of hands in the game of poker. Starting with 
the first five digits in a series of n digits, the five digits are considered as the initial poker hand. 
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In the analysis to be conducted, flushes are not possible, straights will not be employed, but 
five of a kind can occur. The analysis is repeated, employing as the second hand digits 6 through 
10, and then as the third hand digits 11 through 15, and so on." The chi-square goodness-of-fit 
test can be employed to evaluate the results of the test by comparing the observed frequencies 
for each of the possible hands with their theoretical/expected frequencies. Within the framework 
of tests of randomness that have been developed, the poker test is among the most stringent. 
Series of digits that are able to meet the criteria of other tests of randomness will often fail the 
poker test. 

The same 120 digit series that was evaluated earlier with the frequency test and the gap 
test will be employed to illustrate the poker test. Each of the seven possible poker hands is 
listed, followed in parenthesis by a five digit hand illustrating it, as well as the probability of 
obtaining that hand: a) All five digits different (12345; p = .3024); b) One pair (11234; p = 
.5040); c) Two pair (11223; p = .1080); d) Three of a kind (11123; p = .0720); d) Four of a 
kind (11112; p = .0045); e) Five of a kind (11111; p = .0001); f) Full house (11122; p = 
.0090).? Table 10.3 summarizes the chi-square analysis of the data for the 120 digit series. The 
latter series yields only 24 poker hands, since 120/5 = 24. Although, in reality, a much larger 
number of hands should be employed in using the poker test to assess randomness, for purposes 
of demonstration we will employ the 24 available hands. In Table 10.3, the expected number of 
observations for each type of hand was computed by multiplying the total number of hands (24) 
by the probability of that hand occurring. Thus, in the case the hand All different, the expected 
probability of 7.2576 was obtained as follows: (24)(.3024) 2 7.2576 . 


Table 10.3 Summary of Chi-Square Analysis for Poker Test 
Observed Expected 


2 

Cell/Poker Hand number of hands number of hands AUTE 
(0) (E) í 

All different 13 7.2576 4.54 
One pair 9 12.096 .79 
Two pair 0 2,592 2.59 
Three of a kind 2 1.728 .04 
Four of a kind 0 .108 Al 
Five of a kind 0 .0024 .00 
Full house 0 .216 22 
Sums 24 24 X = 8.29 


The degrees of freedom for the chi-square table equals 6, since it is one less than the 
number of cells/categories (which equals 7). In order to reject the null hypothesis, and thus con- 
clude that the distribution is not random, it is required that the computed value of chi-square is 
equal to or greater than the tabled critical value at the prespecified level of significance. 
Employing Table A4 for df = 6, Xs = 12.59 and Xo = 16.81. Since the obtained value 
x? = 8.29 is less than Xs = 12.59, the null hypothesis is retained. Thus, the data are consistent 
with the series being random. 


Test 10e: The maximum test Described in Gruenberger and Jaffray (1965), the maximum 
test evaluates strings of three consecutive digits and records the number of cases in which the 
middle digit is higher than either of the outside two digits (e.g., in the string 152, the 5 is larger 
than the 1 and 2). Gruenberger and Jaffray (1965) note that the likelihood of the latter occurring 
is .285. The binomial sign test for a single sample can be employed to evaluate the data, with 
n representing the total number of three digit strings analyzed in a sequence, and 1, = .285 and 
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T, = .715, respectively, representing the likelihood of a hit versus a miss (where a hit is the 
middle digit being greater than the two outside digits). 

The same series of 120 digits that have been evaluated with the other tests described in this 
section will now be evaluated with the maximum test. Beginning with the first digit and moving 
sequentially, 40 three-digit strings can be demarcated. Thus, the first string in the series which 
consists of the digits 893 is a hit, since the middle digit 9 is greater than the two outside digits 
8 and 3. The second string 723 is a miss, since the middle digit 2 is not higher than both of the 
outside digits. Altogether, in 10 of the 40 strings (p = 10/40 = .225) the middle digit is larger 
than the two outside digits. The expected number of strings where the middle digit will be higher 
than the outside digit is p = (x)(n) = (.285)(40) = 11.4. Employing Equation 9.7, (the 
normal approximation the binomial sign test for a single sample), the value z = —49 is 
computed. Equation 9.9, the continuity-corrected equation, yields the value z 2 —32. The 
negative sign for the z values indicates that the observed number of strings is less than the 
expected number of strings. 


nnm, (40)(.285)(.715) 

Since the absolute values z = .49 and z = .32 are lower than the tabled critical two-tailed 
value Z ¿= 1.96 and the tabled critical one-tailed value z,, = 1.65, the null hypothesis cannot 
be rejected, regardless of which alternative hypothesis is employed. The observed value x = 10 
is well within chance expectation. Thus, we cannot conclude that the series is not random. 

With the exception of the gap test that was conducted on the mean and variance for the 
digit 0, the 120 digit series passed all of the other tests of randomness (i.e., the frequency test, 
the gap test for gap lengths for the entire distribution, the poker test, and the maximum 
test). Note that the tests thus far described in this section are employed, for the most part, to 
evaluate series that are comprised of single digit integers and/or discrete data. The next test that 
will be described is designed to evaluate continuous data (i.e., values that are not necessarily 
whole numbers). 


Test 10f: The mean square successive difference test (The reader should note that this test 
evaluates interval/ratio data.) Attributed to Bellinson et al. (1941) and von Neumann (1941) and 
described in Bennett and Franklin (1954), Chou (1989), and Zar (1999), the mean square 
successive difference test contrasts the mean of the squares of the differences of (n — 1) 
successive differences in a series of n numbers with the variance of the n numbers. Within the 
framework of the test, the mean of the squares of the successive differences is conceptualized as 
an alternative measure of variance that is contrasted with the estimated population variance 
(which is computed with Equation I.5). The mean square successive difference test can be em- 
ployed to evaluate whether or not a sequence of continuous interval/ratio scores is random. Zar 
(1999, p. 586) notes that an assumption of the latter test is that the data are derived from a 
normally distributed population. 

The mean square successive difference test will be employed to evaluate Example 10.2 
(which employs the same data as Example 10.5), which was previously evaluated with the runs 
test for serial randomness (Test 10a). The serial distribution of the 21 scores for Example 10.2 
is presented below. 


1.90, 1.99, 2.00, 1.78, 1.77, 1.76, 1.98, 1.90, 1.65, 1.76, 2.01, 1.78, 1.99, 1,76, 1.94, 1.78, 1.67, 
1.87, 1.91, 1.91, 1.89 
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The initial value that is computed in conducting the mean square successive difference 
test is the estimated population variance. Employing Equation I.5, the latter value is computed 
to be $? = .0122 for the n = 21 scores in the series (the mean score in the series is X = 1.857). 
Equation 10.8 is now employed to compute the mean of the squares of the successive 
differences (more specifically, the unbiased estimate of that value in the population). The latter 
value will be represented by the symbol $2. 


n-1 
D. - Xy 


posadi o ooo Equation 10.8 
ms 2n - D (Eq ) 


The numerator of Equation 10.8 indicates that each score in the series is subtracted from 
the score that comes after it, each of the difference scores is squared, and the squared difference 
scores are summed. The sum of the squared difference scores is divided by 2(n - 1), yielding 
the value of $2. The computation of the value ce = .0128 is demonstrated below. 


7? = (1.99 - 1.90 + (2.00 - 1.99)? +... + (1.91 - 1.91)? + (1.89 - 1.91)? EN 0128 

ms 201-1) 40 f 

Equation 10.9 is employed to compute the test statistic, which Young (1941) designated as 

C. Employing the latter equation, the value C = —.049 is computed. Note that the value of C will 

be positive when $,, < $°, negative when $2 > $?,and equal to 0 when the two values are 
equal. 


on 


E 0128 
Hb sd urs 
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7? 0122 





= -.049 (Equation 10.9) 


Lu 


In order to reject the null hypothesis and conclude that the distribution is nonrandom, the 
absolute value of C must be equal to or greater than the tabled critical value of the C statistic at 
the prespecified level of significance. A large absolute C value indicates there is a large 
discrepancy between the values $? and $2 . A table of critical values can be found in Zar (1999), 
who developed his table based on the work of Young (1941). Another table of the sampling 
distribution for this test (although not C values) was derived by Hart (1942), and can be found 
in Bennett and Franklin (1954). The computed absolute value C = .049 is less than Zar' s (1999) 
tabled critical values Cy, = .343 and Cy, = .470. We thus retain the null hypothesis. In other 
words, the evidence does not indicate the distribution is not random. 

In lieu of the exact table for the sampling distribution of C, we will employ a large sample 
normal approximation for the C statistic which is computed with Equation 10.10. Employing the 
latter equation, the value z = —.24 is computed. 


C - -.049 
n-2 21-2 
n? -1 Q1 
Zar (1999) notes that one-tailed probabilities should be employed for the above analysis. 
To reject the null hypothesis and conclude that the distribution is nonrandom, the absolute value 
of z must be equal to or greater than the tabled critical value at the prespecified level of 


significance. Since the absolute value z = .24 is less than the tabled critical one-tailed value 
Zo5= 1.65, the null hypothesis cannot be rejected. Thus, the evidence does not indicate the 


Z= 





= -24 (Equation 10.10) 
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distribution is nonrandom. This is consistent with the conclusion that was reached when the 
same set of data was evaluated with the runs test for serial randomness. 


Additional tests of randomness 

1) Autocorrelation (also known asserial correlation) This procedure, which is discussed 
in detail in Section VIL of the Pearson product-moment correlation coefficient (Test 28), can 
be employed with series in which, in each trial, there are two or more possible categorical out- 
comes, or with continuous serial data (such as the data that were evaluated with the runs test for 
serial randomness and the mean square successive difference test). Within the framework of 
autocorrelation, one can conclude that a series is random if the correlation coefficient between 
successive numbers in the series is equal to zero. The Durbin-Watson test (1950, 1951, 1971) 
is one of a number of procedures that are employed for autocorrelation. The latter test is 
described in sources such as Chou (1989), Montgomery and Peck (1992), and Netter et al. 
(1988). 

2) The coupon collector's test The coupon collector's test evaluates the number of digits 
required to make a complete set, which consists of all the integer values 1 to k. Mosteller (1965, 
p. 15) notes that in a random series, the average number of digits required (i.e., the expected 
value) to obtain a set that includes all k digits can be computed through use of the harmonic series 

k[1/k + 1/(k-1) + 1/(k-2) + -+ 1/2 + 1]. Thus, if the integers 1 through 5 are employed in a 
series of random numbers, the expected number of trials that will be required to have a set that 
includes all five digits is 5(1/5 + 1/4 + 1/3 + 1/2 + 1) 2 11.42. The average number of digits can 
also be approximated quite accurately by employing the following equation: kInk+.577k + 1/2. 
When the latter equation is solved for k = 5 digits, it results in the value 11.43 (i.e., 5In 5 + 
(.577)(5) + 1/2 = 11.43). Within the framework of the coupon collector's test, inferential 
statistical procedures can be employed to do the following: a) Employ the single-sample z test 
to contrast the predicted/expected average number of digits (based on the equation noted above) 
with the observed average number of digits required for all of the complete sets comprised of k 
digits in the series; and b) Employ the chi-square goodness-of-fit test to evaluate the expected 
versus observed frequencies for categories that represent different values for the number of digits 
required to comprise a full set of k digits in the series. The algorithm for the coupon collector's 
test for computing expected probabilities (required for computing the expected frequencies) is 
described in Knuth (1969, 1981, 1997). 

3) The serial test The serial test evaluates the occurrence of each two-digit combination 
ranging from 00 to kk, where k is the largest digit that can occur in the series. The serial test 
which can be generalized to groups consisting of combinations of three or more digits is 
described in more detail in Emshoff and Sisson (1970). 

4) The d? test of random numbers Described in Gruenberger and Jaffray (1965), the 
d? test of random numbers conceptualizes random numbers as coordinates on a graph, and 
addresses the following question: If two points are chosen at random within a one unit square 
grid, what is the likelihood that the distance between the two points is greater than a specific 
value (e.g., that the value of d is greater than .5 and thus d? is greater than .25)? The d? test of 
random numbers is a stringent test that can be employed to evaluate random numbers that are 
in the range 0 to 1, since each point is represented by two numbers that fall in the latter range of 
values. 

5) Tests of trend analysis/time series analysis Economists often refer to a set of ob- 
servations that are measured over a period of time as a time series. The pattern of the data in a 
time series may be random or may instead be characterized by patterns or trends. Trend analysis 
and time series analysis are terms that are used to describe a variety of statistical procedures 
(such as those that have been described in this section) which are employed for analyzing such 
data. Among the other tests that are used for trend analysis that can be employed to identify a 
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nonrandom series is the Cox-Stuart test for trend (developed by Cox and Stuart (1955) and 
described in Conover (1999) and Daniel (1988)), which is a modification of the binomial sign 
test for a single sample. Other tests of time series and trend analysis are commonly described 
in books on business and economic statistics (e.g., Chou (1989), Hoel and Jessen (1982), 
Montgomery and Peck (1992), and Netter et al. (1988)). 
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Endnotes 


1.  Analternate definition of randomness employed by some sources is that in a random series, 
each of k possible alternatives is equally likely to occur on any trial, and that the outcome 
on each trial is independent of the outcome on any other trial. The problem with the latter 
definition is that it cannot be applied to a series in which on each trial there are two or more 
alternatives which do not have a equal likelihood of occurring (the stipulation regarding 
independence does, however, also apply to a series involving alternatives that do not have 
an equal likelihood of occurring on each trial). In point of fact, it is possible to apply the 
concept of randomness to a series in which 7, * T. To illustrate the latter, consider the 
following example. Assume we have a series consisting of N trials involving a binomially 
distributed variable for which there are two possible outcomes A and B. The theoretical 
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probabilities in the underlying population for each of the outcomes are x, = .75 and 
Tp = .25. If a series involving the two alternatives is in fact random, on each trial the 
respective likelihoods of alternative A versus alternative B occurring will not be 
T, = Tp = .5, but instead willbe x, = .75 and m, = .25. If such a series is random 
it is expected that alternative A will occur approximately 75% of the time and alternative 
B will occur approximately 25% of the time. However, it is important to note that 
one cannot conclude that the above series is random purely on the basis of the relative 
frequencies of the two alternatives. To illustrate this, consider the following series 
consisting of 28 trials which is characterized by the presence of an invariant pattern: 
AAABAAABAAABAAABAAABAAABAAAB. If one is attempting to predict the out- 
come on the 29th trial, and if, in fact, the periodicity of the pattern that is depicted is 
invariant, the likelihood that alternative A will occur on the next trial is not .75, but is, in 
fact, 1. This is the case, since the occurrence of events in the series can be summarized by 
the simple algorithm that the series is comprised of 4 trial cycles, and within each cycle 
alternative A occurs on the first 3 trials and alternative B on the fourth trial. The point to 
be made here is that it is entirely possible to have a random series, even if each of the 
alternatives is not equally likely to occur on every trial. However, if the occurrence of the 
alternatives is consistent with their theoretical frequencies, the latter in and of itself does 
not insure that the series is random. 


2.  Itshould be pointed out that, in actuality, each of the three series depicted in Figure 10.1 
has an equal likelihood of occurring. However, in most instances where a consistent pattern 
is present that persists over a large number of trials, such a pattern is more likely to be 
attributed to a nonrandom factor than it is to chance. 


3. The computation of the values in Table A8 is based on the following logic. If a series 
consists of N trials and alternative 1 occurs n, times and alternative 2 occurs n, times, the 
number of possible combinations involving alternative 1 occurring n, times and alternative 
2 occurring n, times will be M = M/(n,!n,!). Thus, if a coin is tossed N = 4 times, 
since M = 4/(2! 2!) = 6, there will be 6 possible ways of obtaining n, = 2 Heads and 
n, = 2 Tails. Specifically, the 6 ways of obtaining 2 Heads and 2 Tails are: HHTT, 
TTHH, THHT, HTTH, THTH, HTHT. Each ofthe 6 aforementioned sequences constitutes 
a series, and the likelihood of each of the series occurring is equal. The two series HHTT 
and TTHH are comprised of 2 runs, the two series THHT and HTTH are comprised of 3 
runs, and the two series THTH and HTHT are comprised of 4 runs. Thus, the likelihood 
of observing 2 runs will equal 2/6 = .33, the likelihood of observing 3 runs will equal 2/6 
= .33, and the likelihood of observing 4 runs will equal 2/6 = .33. The likelihood of 
observing 3 or more runs will equal .67, and the likelihood of observing 2 or more runs will 
equal 1. A thorough discussion of the derivation of the sampling distribution for the single- 
sample runs test, which is attributed in some sources to Wald and Wolfowitz (1940), is 
described in Hogg and Tanis (1988). 


4. Some of the cells in Table A8 only list a lower limit. For the sample sizes in question, 
there is no maximum number of runs (upper limit) that will allow the null hypothesis to be 
rejected. 

5. A general discussion of the correction for continuity can be found under the Wilcoxon 


signed-ranks test (Test 6). The reader should take note of the fact that the correction for 
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continuity described in this section is intended to provide a more conservative test of the 
null hypothesis (i.e., make it more difficult to reject). However, when the absolute value 
of the numerator of Equation 10.1 is equal to or very close to zero, the z value computed 
with Equation 10.2 will be further removed from zero than the z value computed with 
Equation 10.1. Since the continuity-corrected z value will be extremely close to zero, this 
result is of no practical consequence (i.e., the null hypothesis will still be retained). Zar 
(1999, p. 493), however, notes that in actuality the correction for continuity should not be 
applied if it increases rather than decreases the absolute value of the test statistic. This 
observation regarding the correction for continuity can be generalized to the continuity- 
correction described in the book for other nonparametric tests. 


6. The term Monte Carlo derives from the fact that Ulam had an uncle with a predilection for 
gambling who often frequented the casinos at Monte Carlo. 


7. The application of the single-sample runs test to a design involving two independent 
samples is described in Siegel (1956) under the Wald-Wolfowitz (1940) runs test. 


8. A number is prime if it has no divisors except for itself and the value 1. In other words, if 
a prime number is divided by any number except itself or 1, it will yield a remainder. 


9. The author is indebted to Theodore Sheskin for providing some of the reference material 
employed in this section. 


10. Although Equation 10.2 (the continuity-corrected equation for the single-sample runs test) 
yields a slightly smaller absolute z value for this example and the example to follow, it leads 
to identical conclusions with respect to the null hypothesis. 


11. Gruenberger and Jaffray (1965) note than an even more stringent variant of the poker test 
employs digits 2 through 6 as the second hand, digits 3 through 7 as the third hand, digits 
4 through 8 as the fourth hand, and so on. The analysis is carried on until the end of the 
series (which will be the point at which a five-digit hand is no longer possible). The total 
of (n — 4) possible hands can be evaluated with the chi-square goodness-of-fit test. 
However, since within the format just described the hands are not actually independent of 
one another (because they contain overlapping data), the assumption of independence for 
the chi-square goodness-of-fit test is violated. 


12. a) Phillips et al. (1976), and Schmidt and Taylor (1970) describe the computation of the 
probabilities that are listed for the poker test for a five digit hand; b) Although it is 
generally employed with groups of five digits, the poker test can be applied to groups that 
consist of more or less than five digits. The poker test probabilities (for k = 10 digits) for 
afour digit hand (Schmidt and Taylor (1970)) and a three digit hand (Banks and Carson 
(1984)), along with a sample hand, are as follows: Four digit hand: Al four digits dif- 
ferent (1234; p = .504); One pair (1123; p = .432 ); Two pair (1122; p = .027 ); Three 
of a kind (1112; p 2.036); Four of a kind (1111; p = .001.). Three digit hand: All three 
digits different (123; p = .72); One pair (112; p = .27 ); Three of a kind (111; p = .01). 
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Inferential Statistical Tests Employed 
with Two Independent Samples 


Test 11: 


Test 12: 


Test 13: 


Test 14: 


Test 15: 


Test 16: 


(and Related Measures of 
Association/Correlation) 


The ź Test for Two Independent Samples 
The Mann-Whitney U Test 


The Kolmogorov-Smirnov Test for Two 
Independent Samples 


The Siegel-Tukey Test for Equal Variability 
The Moses Test for Equal Variability 


The Chi-Square Test for r x c Tables 
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Test 11 


The ¢ Test for Two Independent Samples 
(Parametric Test Employed with Interval/Ratio Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Do two independent samples represent two populations with 
different mean values?! 


Relevant background information on test The ¢ test for two independent samples, which 
is employed in a hypothesis testing situation involving two independent samples, is one of a 
number of inferential statistical tests that are based on the ¢ distribution (which is discussed in 
detail under the single-sample ¢ test (Test 2)). Two or more samples are independent of one 
another if each of the samples is comprised of different subjects.’ In addition to being referred 
to as an independent samples design, a design involving two or more independent samples is 
also referred to as a between-subjects design, a between-groups design, and a randomized- 
groups design. In order to eliminate the possibility of confounding in an independent samples 
design, each subject should be randomly assigned to one of the k (where k > 2) experimental 
conditions. 

In conducting the f test for two independent samples, the two sample means (represented 
by the notations X, and X, ) are employed to estimate the values of the means of the populations 
(p; and p, ) from which the samples are derived. If the result of the t test for two independent 
samples is significant, it indicates the researcher can conclude there is a high likelihood that the 
samples represent populations with different mean values. It should be noted that the ¢ test for 
two independent samples is the appropriate test to employ for contrasting the means of two 
independent samples when the values of the underlying population variances are unknown. In 
instances where the latter two values are known, the appropriate test to employ is the z test for 
two independent samples (Test 11d), which is described in Section VI. 

The ¢ test for two independent samples is employed with interval/ratio data, and is based 
on the following assumptions: a) Each sample has been randomly selected from the population 
it represents; b) The distribution of data in the underlying population from which each of the 
samples is derived is normal; and c) The third assumption, which is referred to as the 
homogeneity of variance assumption, states that the variance of the underlying population 
represented by Sample 1 is equal to the variance of the underlying population represented by 
Sample 2 (i.e., o? = 02). The homogeneity of variance assumption is discussed in detail in 
Section VI. If any ofthe aforementioned assumptions are saliently violated, the reliability of the ¢ 
test statistic may be compromised. 


II. Example 
Example 11.1 Jn order to assess the efficacy of a new antidepressant drug, ten clinically 


depressed patients are randomly assigned to one of two groups. Five patients are assigned to 
Group 1, which is administered the antidepressant drug for a period of six months. The other five 
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patients are assigned to Group 2, which is administered a placebo during the same six-month 
period. Assume that prior to introducing the experimental treatments, the experimenter con- 
firmed that the level of depression in the two groups was equal. After six months elapse all ten 
subjects are rated by a psychiatrist (who is blind with respect to a subject's experimental 
condition) on their level of depression. The psychiatrist's depression ratings for the five subjects 
in each group follow (the higher the rating the more depressed a subject): Group 1: 11, 1, 0, 
2,0; Group 2: 11,11, 5, 8, 4. Do the data indicate that the antidepressant drug is effective? 


III. Null versus Alternative Hypotheses 


Null hypothesis Hy Hi = m 


(The mean of the population Group 1 represents equals the mean of the population Group 2 
represents.) 


Alternative hypothesis Ay: py # by 


(The mean of the population Group | represents does not equal the mean of the population Group 
2 represents. This is a nondirectional alternative hypothesis and it is evaluated with a two- 
tailed test. In order to be supported, the absolute value of t must be equal to or greater than the 
tabled critical two-tailed t value at the prespecified level of significance. Thus, either a 
significant positive f value or a significant negative f value will provide support for this 
alternative hypothesis.) 


Or 

Ay: py > m 
(The mean of the population Group 1 represents is greater than the mean of the population Group 
2 represents. This is a directional alternative hypothesis and it is evaluated with a one-tailed 


test. It will only be supported if the sign of t is positive, and the absolute value of t is equal to 
or greater than the tabled critical one-tailed t value at the prespecified level of significance.) 


Or 

H: m < By 
(The mean of the population Group 1 represents is less than the mean of the population Group 
2 represents. This is a directional alternative hypothesis and it is evaluated with a one-tailed 


test. It will only be supported if the sign of t is negative, and the absolute value of t is equal to 
or greater than the tabled critical one-tailed t value at the prespecified level of significance.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected." 


IV. Test Computations 


The data for Example 11.1 are summarized in Table 11.1. In the example there are n, = 5 
subjects in Group 1 and n, = 5 subjects in Group 2. In Table 11.1 each subject is identified by 
a two digit number. The first digit before the comma indicates the subject' s number within the 
group, and the second digit indicates the group identification number. Thus, Subject i, jis the ; 
subject in Group j. The scores of the 10 subjects are listed in the columns of Table 11.1 labelled 
X, and X,. The adjacent columns labelled x and x, contain the square of each subjects score. 
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Table 11.1 Data for Example 11.1 


Group 1 Group 2 
2 2 
X, X; X, X, 
Subject 1,1 11 121 Subject 1,2 11 121 
Subject 2,1 1 1 Subject 2,2 11 121 
Subject 3,1 0 0 Subject 3,2 5 25 
Subject 4,1 2 4 Subject 4,2 8 64 
Subject 5,1 0 0 Subject 5,2 4 16 
XX 34 EX? = 126 EX, =39  YXj-347 


Employing Equations I.1 and I.5, the mean and estimated population variance for each 
sample is computed below. 








, XÈ j 
e ope ME 
Xo ee, ee elc c LU o9 ono 
: n, 5 ! News 5-—4 
(xy 2 
€ YX;- —Ó— yq. GY 
X ue d) NR xg c t e e E 
? n, 5 l 1 n,-1 5-1 


When there are an equal number of subjects in each sample, Equation 11.1 can be 
employed to compute the test statistic for the t test for two independent samples." 








X, -X A 
t = —_ (Equation 11.1) 
Employing Equation 11.1, the value t = —1.96 is computed. 
ne 2.8 - 7.8 - is -= -1.96 
21.7 , 10.7 : 
5 5 
Equation 11.2 is an alternative way of expressing Equation 11.1. 
X, B X, : 
t = (Equation 11.2) 
2 2 
Sy + Sy 


Note that in Equation 11.2, the values s$ and Sz represent the squares of the standard 
1 2 


error of the means of the two groups. Employing the square of the value computed with Equa- 

tion 2.2 (presented in Section IV of the single-sample ¢ test), the squared standard error of the 

means of the two samples are computed: sz = $n, = 21.7/5 = 4.34 and s$ = $n, = 10.7/5 
1 2 


= 2.14. When the values s$ = 4.34 and E - 2.14 are substituted in Equation 11.2, they yield 
1 1 
the value t = -1.96: t = (2.8 - 7.8)//4.34 + 2.14 = -1.96. 
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The reader should take note of the fact that the values S? ; 82 E sz , and sz (all of which are 
i : 


estimates of either the variance of a population or the variance of a sampling distribution) can 
never be negative numbers. If a negative value is obtained for any of the aforementioned values, 
it indicates a computational error has been made. 

Equation 11.3 is a general equation for the t test for two independent samples that can be 
employed for both equal and unequal samples sizes (when n, = n,, Equation 11.3 becomes 
equivalent to Equations 11.1/11.2). 








X, H X, A 
f= __ (Equation 11.3) 
m - D$ «e - DEL a 
n +n, - 2 n, n, 














In the case of Example 11.1, Equation 11.3 yields the identical value £ = -1.96 obtained 
with Equations 11.1/11.2. 


" 2.8 - 7.8 -— 
6- DLD + 6 -Dao»[i,1 
5+5-2 5 5 











The left element inside the radical of the denominator of Equation 11.3 represents a 
weighted average (based on the values of n, and n, ) of the estimated population variances of the 
two groups. This weighted average is referred to as a pooled variance estimate, represented by 
the notation $?2 Thus: $? = [n - 157 + (n, - DSI + n, - 2). It should be noted 
that if Equations 11.1/11.2 are applied to data where n; + n,, the absolute value of t will be 
slightly higher than the value computed with Equation 11.3. Thus, use of Equations 11.1/11.2 
when n; * n, makes it easier to reject the null hypothesis, and consequently inflates the 
likelihood of committing a Type I error. The application of Equations 11.1—11.3 to a set of data 
when n, * n, is illustrated in Section VII. 

Regardless of which equation is employed, the denominator of the ¢ test for two inde- 
pendent samples is referred to as the standard error of the difference. This latter value, which 
can be summarized with the notation Sk ox represents an estimated standard deviation of 


difference scores for two populations. Thus, in Example 11.1, Sx ox, = 2.55. If Sk ox is 


employed as the denominator of the equation for the ¢ test for two independent samples, the 
equation can be written as follows: £ = (X, - Xylsg - Es 

It should be noted that in some sources the numerator of Equations 11.1—11.3 is written as 
follows: [(X, - X) - (n, - m)l]. The latter notation is only necessary if in stating the null 
hypothesis, a researcher stipulates that the difference between p, and p, is some value other than 
zero. When the null hypothesis is Hy: pu, = p, the value (p, - p,) reduces to zero, leaving the 
term (X, - Xj) as the numerator of the f test equation. The application of the ¢ test for two 
independent samples to a hypothesis testing situation in which a value other than zero is 
stipulated in the null hypothesis is illustrated with Example 11.2 in Section VI. 


V. Interpretation of the Test Results 
The obtained value t = — 1.96 is evaluated with Table A2 (Table of Student's t Distribution) 


in the Appendix. The degrees of freedom for the ¢ test for two independent samples are 
computed with Equation 11.4.’ 


© 2000 by Chapman & Hall/CRC 


df=n, +n, - 2 (Equation 11.4) 


Employing Equation 11.4, the value df = 5 + 5 - 2 = 8 is computed. Thus, the tabled 
critical f values that are employed in evaluating the results of Example 11.1 are the values 
recorded in the cells of Table A2 that fall in the row for df = 8, and the columns with 
probabilities that correspond to the two-tailed and one-tailed .05 and .01 values. (The protocol 
for employing Table A2 is described in Section V of the single-sample ¢ test.) The critical t 
values for df = 8 are summarized in Table 11.2. 


Table 11.2 Tabled Critical .05 and .01 Values df = 8 


tos to 
Two-tailed values 2.31 3.36 
One-tailed values 1.86 2.90 


The following guidelines are employed in evaluating the null hypothesis for the ¢ test for 
two independent samples. 

a) If the nondirectional alternative hypothesis H,: p, * m, is employed, the null hypothe- 
sis can be rejected if the obtained absolute value of t is equal to or greater than the tabled critical 
two-tailed value at the prespecified level of significance. 

b) If the directional alternative hypothesis H,: p, > p, is employed, the null hypothesis 
can be rejected if the sign of t is positive, and the value of t is equal to or greater than the tabled 
critical one-tailed value at the prespecified level of significance. 

c) If the directional alternative hypothesis H,: p, < p, is employed, the null hypothesis 
can be rejected if the sign of t is negative, and the absolute value of t is equal to or greater than 
the tabled critical one-tailed value at the prespecified level of significance. 

Employing the above guidelines, the null hypothesis can only be rejected (and only at the 
.05 level) if the directional alternative hypothesis H,: p, < p, is employed. This is the case, 
since the obtained value t = —1.96 is a negative number, and the absolute value t = 1.96 is greater 
than the tabled critical one-tailed .05 value £9, = 1.86. This outcome is consistent with the 
prediction that the group which receives the antidepressant will exhibit a lower level of 
depression than the placebo group. Note that the alternative hypothesis H,: p, < p, is not 
supported at the .01 level, since the obtained absolute value t = 1.96 is less than the tabled critical 
one-tailed .01 value £9, = 2.90. 

The nondirectional alternative hypothesis H,: p, * p, is not supported, since the obtained 
absolute value ¢ = 1.96 is less than the tabled critical two-tailed .05 value 1,. = 2.31. 

The directional alternative hypothesis H,: 4, > p, is not supported, since the obtained 
value t = —1.96 is a negative number. In order for the alternative hypothesis H,: u, > p, tobe 
supported, the computed value of t must be a positive number (as well as the fact that the abso- 
lute value of t must be equal to or greater than the tabled critical one-tailed value at the 
prespecified level of significance). It should be noted, that it is not likely the researcher would 
employ the latter alternative hypothesis, since it predicts that the placebo group will exhibit a 
lower level of depression than the group that receives the antidepressant. 

A summary of the analysis of Example 11.1 with the £ test for two independent samples 
follows: It can be concluded that the average depression rating for the group that receives the 
antidepressant medication is significantly less than the average depression rating for the placebo 
group. This conclusion can only be reached if the directional alternative hypothesis 
H: p, < p, is employed, and the prespecified level of significance is a = .05. This result can 
be summarized as follows: t(8) = 1.96, p < .05.° 


© 2000 by Chapman & Hall/CRC 


VI. Additional Analytical Procedures for the t Test for Two 
Independent Samples and/or Related Tests 


1. The equation for the ¢ test for two independent samples when a value for a difference 
other than zero is stated in the null hypothesis In some sources Equation 11.5 is presented 
as the equation for the t test for two independent samples. 


Fa (X, di X) a (u z I5) 


Sx, i X, 


(Equation 11.5) 





Itis only necessary to employ Equation 11.5 if, in stating the null hypothesis, a researcher 
stipulates that the difference between p, and p, is some value other than zero. When the null 
hypothesis is Hy: p, = p, (which as noted previously can also be written as Hy: p; - p, = 9), 
the value of (u, - p) reduces to zero, and thus what remains of the numerator in Equation 11.5 
is (X, - X,), which constitutes the numerator of Equations 11.1-11.3. Example 11.2 will be 
employed to illustrate the use of Equation 11.5 in a hypothesis-testing situation in which some 
value other than zero is stipulated in the null hypothesis. 


Example 11.2 The Accusharp Battery Company claims that the hearing aid battery it 
manufactures has an average life span that is two hours longer than the average life span of a 
battery manufactured by the Keenair Battery Company. In order to evaluate the claim, an 
independent researcher measures the life span of five randomly selected batteries from the stock 
of each of the two companies, and obtains the following values: Accusharp: 10, 8, 10, 9, 11; 
Keenair: 8,9, 8, 7, 9. Do the data support the claim of the Accusharp Company? 


Since the Accusharp Company (which will be designated as Group 1) specifically predicts 
that the life span of its battery is 2 hours longer, the null hypothesis can be stated as follows: 
Hy pa - pa = 2. The alternative hypothesis if stated nondirectionally is H,: p; - p, * 2. 
If the computed absolute value of t is equal to or greater than the tabled critical two-tailed t value 
at the prespecified level of significance, the nondirectional alternative hypothesis is supported. 
If stated directionally, the appropriate alternative hypothesis to employ is H,: p - p, < 2. 
The latter directional alternative hypothesis (which predicts a negative t value) is employed, since 
in order for the data to contradict the claim of the Accusharp Company (and thus reject the null 
hypothesis), the life span of the latter’s battery can be any value that is less than 2 hours longer 
than that of the Keenair battery. The alternative hypothesis H,: p, - p, > 2 (which is only 
supported with a positive ¢ value) predicts that the superiority of the Accusharp battery is greater 
than 2 hours. If the latter alternative hypothesis is employed, the null hypothesis can only be 
rejected if the life span of the Accusharp battery is greater than 2 hours longer than that of the 
Keenair battery. 

The analysis for Example 11.2 is summarized below. 


Xx -48 X -48-96 xx? = 466 
41 2 
Seal Oxo Seo. xe = 339 
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Since df = 5 + 5 - 2 = 8, the tabled critical values in Table 11.2 can be employed to 
evaluate the results of the analysis. Since the obtained absolute value t = .95 is less than the 
tabled critical two-tailed value t,, = 2.31, the null hypothesis is retained. Thus, the non- 
directional alternative hypothesis H,: u, - p, * 2 is not supported. It is also true that the 
directional alternative hypothesis H,: gu, - m, < 2 is not supported. This is the case since 
although, as predicted, the sign of the computed ¢ value is negative, the absolute value t = .95 
is less than the tabled critical one-tailed value t,, = 1.86. The nondirectional alterna- 
tive hypothesis H,: p, - p, > 2 is not supported, since in order for the latter directional 
alternative hypothesis to be supported, the sign of t must be positive, and the absolute value of 
t must be equal to or greater than tabled critical one-tailed value at the prespecified level of sig- 
nificance. 

Thus, irrespective of whether a nondirectional or directional alternative hypothesis is 
employed, the data are consistent with the claim of the Accusharp company that it manufactures 
a battery that has a life span which is at least two hours longer than that of the Keenair battery. 
In other words, the obtained difference (X, - X,) = 1.4 in the numerator of Equation 11.5 is 
not small enough to support the directional alternative hypothesis H,: p; - m, < 2. 

If Hy: p, = p, and Hj: p, * p, are employed as the null hypothesis and nondirectional 
alternative hypothesis for Example 11.2, analysis of the data yields the following result: 
t = (9.6 — 8.2)/.63 = 2.22. Since the obtained value t = 2.22 is less than the tabled critical two- 
tailed value f; = 2.31, the nondirectional alternative hypothesis H,: p, * p, is not supported 
at the .05 level. Thus, one cannot conclude that the life span of the Accusharp battery is sig- 
nificantly different than the life span of the Keenair battery. The directional alternative 
hypothesis H,: p, > p, (which can also be written as H,: p; - p, > 0) is supported at the 
.05 level, since t 2 2.22 is a positive number that is larger than the tabled critical one-tailed value 
tos = 1.86. Thus, ifthe nullhypothesis H,: , = p, andthedirectionalaltemative hypothesis Hj: p; > p, 
are employed, the researcher is able to conclude that the life span of the Accusharp battery is 
significantly longer than the life span of the Keenair battery. 

The evaluation of Example 11.2 in this section reflects that fact that the conclusions one 
reaches can be affected by how a researcher states the null and alternative hypotheses. In the 
case of Example 11.2, the fact that the nondirectional alternative hypothesis H;: p; - m, * 2 
is not supported suggests that there is a two-hour difference in favor of Accustar (since the null 
hypothesis Hy: p, - p, = 2 is retained). Yet, if the nondirectional alternative hypothesis 
H,: p, # p, isemployed, the fact that it is not supported suggests there is no difference between 
the two brands of batteries. 


2. Test 11a: Hartley’s F,,,, test for homogeneity of variance/F test for two population 
variances: Evaluation of the homogeneity of variance assumption of the ¢ test for two 
independent samples Itis noted in Section I that one assumption of the t test for two inde- 
pendent samples is homogeneity of variance. Specifically, the homogeneity of variance assump- 
tion evaluates whether there is evidence to indicate that an inequality exists between the variances 
of the populations represented by the two experimental samples. When the latter condition exists, 
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it is referred to as heterogeneity of variance. The null and alternative hypotheses employed in 
evaluating the homogeneity of variance assumption are as follows. 


Null hypothesis Hy; o = o 


(The variance of the population Group 1 represents equals the variance of the population Group 
2 represents.) 


Alternative hypothesis H: o; + 0 


(The variance of the population Group 1 represents does not equal the variance of the population 
Group 2 represents. This is a nondirectional alternative hypothesis and it is evaluated with a 
two-tailed test. In evaluating the homogeneity of variance assumption for the t test for two 
independent samples, a nondirectional alternative hypothesis is always employed.) 


One of a number of procedures that can be used to evaluate the homogeneity of variance 
hypothesis is Hartley’s F nax test (Hartley (1940,1950)), which can be employed with a design 
involving two or more independent samples.’ Although the F nax test assumes an equal number 
of subjects per group, Kirk (1982, 1995) and Winer et al. (1991) among others note that if 
n, + n,, but are approximately the same size, one can let the value of the larger sample size 
represent n when interpreting the F aax test statistic. The latter sources, however, note that using 
the larger n will result in a slight increase in the Type I error rate for the F_ test. 


max 


The test statistic for Hartley's F sax test is computed with Equation 11.6. 


wd 
s 

Fox = (Equation 11.6) 
Š, S 


Where: s = The larger of the two estimated population variances 


Ky " = The smaller of the two estimated population variances 


Employing Equation 11.6 with the estimated population variances computed for Example 
11.1, the value Fax = 2.03 is computed. The reader should take note of the fact that the 
computed value for F pax will always be a positive number that is greater than 1 (unless 5; 2 =ő i ; 
in which case Fax = 1). 


F -41-208 


max 10 


The computed value Fax = 2.03 is evaluated with Table A9 (Table of the Fmax 
Distribution) in the Appendix. The tabled critical values for the F ax distribution are listed in 
reference to the values (n — 1) and k, where n represents the number of subjects per group, and 
k represents the number of groups. In the case of Example 11.1, the valueof n - n, - n, - 5. 
Thus, n - 1 = 5 - 1 = 4. Since there are two groups, k = 2. 

In order to reject the null hypothesis and conclude that the homogeneity of variance assump- 
tion has been violated, the obtained F say value must be equal to or greater than the tabled critical 
value at the prespecified level of significance. All values listed in Table A9 are two-tailed values. 
Inspection of Table A9 indicates that for n - 1 - 4 and Kk - 2, F - 9.6 and 


max o; 


F n = 23.2. Since the obtained value F ax = 2.03 is less than Fax = 9.6, the "homogeneity 


max 
of variance assumption is not violated — in other words the data do no suggest that the variances 
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of the populations represented by the two groups are unequal. Thus, the null hypothesis is 
retained. 

There are a number of additional points that should be made with regard to the above 
analysis: 

a) Some sources employ Equation 11.7 or Equation 11.8 inlieu of Equation 11.6 to evaluate 
the homogeneity of variance assumption. When Equation 11.8 is employed to contrast two 
variances, it is often referred to as an F test for two population variances. 


22 
S 

F-— (Equation 11.7) 
5? 

F= = (Equation 11.8) 
$5 


Both Equations 11.7 and 11.8 compute an F ratio, which is based on the F distribution. 
Critical values for the latter distribution are presented in Table A10 (Table of the F Dis- 
tribution) in the Appendix. The F distribution (which is discussed in greater detail under the 
single-factor between-subjects analysis of variance (Test 21)) is, in fact, the sampling 
distribution upon which the F sax distribution is based. In Table A10, critical values are listed 
in reference to the number of degrees of freedom associated with the numerator and the 
denominator of the F ratio. In employing the F distribution in reference to Equation 11.7, the 
degrees of freedom for the numerator of the F ratio is df,,,,, = n, - 1 (where n, represents the 
number of subjects in the group with the larger estimated population variance), and the degrees 
of freedom for the denominator is df, = n, - 1 (where n, represents the number of subjects 
in the group with the smaller estimated population variance). The tabled F „„; value is employed 
to evaluate a two-tailed alternative hypothesis at the .05 level, and the tabled F ọọ; value is 
employed to evaluate it at the .01 level. The reason for employing the tabled F 97, ANd F4, 
values instead of F, (which in this analysis represents the two-tailed .10 value and the one- 
tailed .05 value) and F 9, (which in this analysis represents the two-tailed .02 value and the one- 
tailed .01 value), is that both tails of the distribution are used in employing the F distribution to 
evaluate a hypothesis about two population variances. Thus, if one is conducting a two-tailed 
analysis with « = .05,.025 (i.e., .05/2 = .025) represents the proportion of cases in the extreme 
left of the left tail of the F distribution, as well as the proportion of cases in the extreme right of 
the right tail of the distribution. With respect to a two-tailed analysis with a = .01, .005 (i.e., 
.01/2 = .005) represents the proportion of cases in the extreme left of the left tail of the 
distribution, as well as the proportion of cases in the extreme right of the right tail of the 
distribution. 

In point of fact, if df, = df, = 4 and df, = dfi, = 4 (which are the values employed 
in Example 11.1), the tabled critical two-tailed F values employed for a = .05 and a = .01 are 
F 4,5 =9.6 and F4, = 23.15. These are the same critical .05 and .01 values that are employed 
for the F max test." Thus, Equation 11.7 employs the same critical values and yields an identical 
result to that obtained with Equation 11.6. It should be noted, however, thatif n, + n,, Equation 
11.6 and Equation 11.7 will employ different critical values, since Equation 11.7 (which uses the 
value of n for each group in determining degrees of freedom) can accommodate unequal sample 
sizes. 

When Group 1 has a larger estimated population variance than Group 2, everything that has 
been said with respect to Equation 11.7 applies to Equation 11.8. However, when s > S? , the 
value of F computed with Equation 11.8 will be less than 1. In such an instance, one can do either 
of the following: a) Designate the group with the larger variance as Group 1 and the group with 
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the smaller variance as Group 2. Upon doing this, divide the larger variance by the smaller 
variance (as is done in Equation 11.7), and thus obtain the same F value derived with Equation 
11.7; or b) Use Equation 11.8, and employ the tabled critical F ,. value to evaluate a two-tailed 
alternative hypothesis at the .05 level of significance, and the tabled critical F ọọ; value to 
evaluate a two-tailed alternative hypothesis at the .01 level of significance." In such a case, in 
order to be significant the computed F value must be equal to or less than the tabled critical F o5 
value (if a = .05) or the tabled critical F ọọ value (if a = .01). Both of the aforementioned 
methods will yield the same conclusions with respect to retaining or rejecting the null hypothesis 
at a given level of significance. 

Equation 11.8 can also be used to test a directional alternative hypothesis concerning the 
relationship between two population variances. If a researcher specifically predicts that the 
variance of the population represented by Group 1 is larger than the variance of the population 
represented by Group 2 (i.e., H: o; > o2), or that the variance of the population represented 
by Group 1 is smaller than the variance of the population represented by Group 2 (i.e., 
H: o? < 02), a directional alternative hypothesis is evaluated. In such a case, the tabled 
critical one-tailed F value for (n, - 1),(n, - 1) degrees of freedom at the prespecified level 
of significance is employed. 

To illustrate the analysis of a one-tailed alternative hypothesis, let us assume that the 
alternative hypothesis H;: o; > o, is evaluated. If a = .05, in order for the result to be 
significant the computed value of F must be greater than 1. In addition, the tabled critical F 
value that is employed in evaluating the above alternative hypothesis is F',,. To be significant, 
the obtained F value must be equal to or greater than the tabled critical F , value If a = .01, the 
tabled critical F value that is employed in evaluating the alternative Borote is Fog- To be 
significant, the obtained F value must be equal to or greater than the tabled critical F, y Value. 

Now let us assume that the alternative hypothesis being evaluated is H}: o z oj. If 
a = .05, in order for the result to be significant, the computed value of F must be de than 1. The 
tabled critical F value that is employed in evaluating the above alternative hypothesis is F o5 
To be significant, the obtained F value must be equal to or less than the tabled critical F 9. value. 
If a = .01, the tabled critical F value that is employed in evaluating the alternative hypothesis is 
F ıı- To be significant, the obtained F value must be equal to or less than the tabled critical 
Fo, value. 

If Equation 11.8 is employed with a one-tailed alternative hypothesis with reference to two 
groups consisting of five subjects per group (as is the case in Example 11.1), the following tabled 
critical values listed for (nj - 1) = 4, (n, - 1) = 4 are LP a)Ifa-.05, F,, = 6.39, 
and Fg, = .157; and b) If a = .01, F,, = 15.98, and F = .063." To illustrate this in a 
situation where an F value less than lis computed: assume dor the moment that we employ the 
alternative hypothesis H;: o? < o, for the two groups described 1 in Example 11.1, and that the 
values of the two group variances are reversed — i.e. * s - 10.7 and &- - 21.7. Employing 
Equation 11.8 with this data, F = 10.7/21.7 = .49. The obtained value F = .49 is not er 
since to be significant, the computed value of F must be equal to or less than Fy, = .157. 

It should be noted that when the general procedure discussed in this Section is employed 
to evaluate the homogeneity of variance assumption with reference to Example 11.1, in order to 
reject the null hypothesis for « = .05, the larger of the estimated population variances must be 
more than 9 times the magnitude of the smaller variance, and for « = .01 the larger variance 
must be more than 23 times the magnitude of the smaller variance. Within the framework of the 
F aax test, such a large discrepancy between the estimated population variances is tolerated when 
the number of subjects per group is small. Inspection of Table A9 reveals that as sample size 
increases, the magnitude of the tabled critical F_ values decreases. Thus, the larger the sample 


max 
size, the smaller the difference between the variances that will be acceptable. 
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Two assumptions common to Equations 11.6—-11.8 (i.e., all of the equations that can be 
employed to evaluate the homogeneity of variance assumption) are: a) Each sample has been 
randomly selected from the population it represents; and b) The distribution of data in the 
underlying population from which each of the samples is derived is normal. Violation of these 
assumptions can compromise the reliability of the F nax test statistic, which many sources note 
is extremely sensitive to violation of the normality assumption. Various sources (e.g., Keppel 
(1991)) point out that when the F nax test is employed to evaluate the homogeneity of variance 
hypothesis, it is not as powerful (i.e., likely to detect heterogeneity of variance when it is present) 
as some alternative but computationally more involved procedures. The consequence of not 
detecting heterogeneity of variance is that it increases the likelihood of committing a Type I error 
in conducting the ¢ test for two independent samples. Additional discussion of the 
homogeneity of variance assumption and alternative procedures that can be used to evaluate it 
can be found in Section VI of the single-factor between-subjects analysis of variance. 

In the event the homogeneity of variance assumption is violated, a number of different 
strategies (which yield similar but not identical results) are recommended with reference to 
conducting the ¢ test for two independent samples. Since heterogeneity of variance increases 
the likelihood of committing a Type I error, all of the strategies that are recommended result in 
a more conservative f test (1.e., making it more difficult for the test to reject the null hypothesis). 
Such strategies compute either: a) An adjusted critical f. value that is larger than the unadjusted 
critical t value; or b) An adjusted degrees of freedom value which is smaller than the value 
computed with Equation 11.4. By decreasing the degrees of freedom, a larger tabled critical t 
value is employed in evaluating the computed t value. 

Before describing one of procedures that can be employed when the homogeneity of 
variance assumption is violated, it should be pointed out that the existence of heterogeneity of 
variance in a set of data may in itself be noteworthy. It is conceivable that although the analysis 
of the data for an experiment may indicate that there is no difference between the group means, 
one cannot rule out the possibility that there may be a significant difference between the 
variances of the two groups. This latter finding may be of practical importance in clarifying the 
relationship between the variables under study. This general issue is addressed in Section VIII 
of the single-sample runs test (Test 10) within the framework of the discussion of Example 
10.6. In the latter discussion, an experiment is described in which subjects in an experimental 
group who receive an antidepressant either improve dramatically or regress while on the drug. 
In contrast, the scores of a placebo/control group exhibit little variability, and fall in between the 
two extreme sets of scores in the group that receives the antidepressant. Analysis of such a study 
with the ¢ test for two independent samples will in all likelihood not yield a significant result, 
since the two groups will probably have approximately the same mean depression score. The fact 
that the group receiving the drug exhibits greater variability than the group receiving the placebo 
indicates that the effect of the drug is not consistent for all people who are depressed. Such an 
effect can be identified through use of a test such as the F max test, which contrasts the variability 
of two groups. 

Two statisticians (Behrens and Fisher) developed a sampling distribution for the f statistic 
when the homogeneity of variance assumption is violated. The latter sampling distribution is 
referred to as the ¢’ distribution. Since tables of critical values developed by Behrens and Fisher 
can only be employed for a limited number of sample sizes, Cochran and Cox (1957) developed 
a methodology that allows one to compute critical values of ¢’ for all values of n, and n,. 
Equation 11.9 summarizes the computation of t’. 


max 
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Where: f, The tabled critical ¢ value at the prespecified level of significance for 


dí-n -1 
t, = The tabled critical t value at the prespecified level of significance for 
df =n, -2 


Equation 11.9 will be employed with the data for Example 11.1. For purposes of 
illustration, it will be assumed that the homogeneity of variance assumption has been violated, 
and that the nondirectional alternative hypothesis H,: p, * p, is evaluated, with œ = .05. 
Simen, = n, = n = 5,df, = df, = n - 1 = 4. Employing Table A2, wedeterminethatfor œ = .05 
and df = 4, the tabled critical two-tailed .05 value is tọ, = 2.78. Thus, the values t, = 2.78 
and f£, = 2.78 are substituted in Equation 11.9, along with the values of the estimated population 
variances and the sample sizes.’ 
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Note that the computed value t’ = 2.78 is larger than the tabled critical two-tailed .05 
value ty, = 2.31, which is employed if the homogeneity of variance adjustment is not violated. 
Since the value of t' will always be larger than the tabled critical t value at the prespecified level 
of significance for df - n, * n, - 2 (except for the instance noted in Endnote 14), use of the 
t' statistic will result in a more conservative test. In our hypothetical example, use of the t’ 
statistic is designed to insure that the Type I error rate will conform to the prespecified value a 
= .05. If there is heterogeneity of variance and the homogeneity of variance adjustment is not 
employed, the actual alpha level will be greater than a = .05. Since in our example the computed 
value t' = 2.78 is larger than the computed absolute value t = 1.96 obtained for the ¢ test, the null 
hypothesis cannot be rejected. The methodology described in this section for dealing with 
heterogeneity of variance provides for a slightly more conservative f test for two independent 
samples than do alternative strategies developed by Satterthwaite (1946) and Welch (1947) 
(which are described in Howell (1992) and Winer et al. (1991)). Another strategy for dealing 
with heterogeneity of variance is to employ, in lieu of the ¢ test, a nonparametric test which does 
not assume homegeneity of variance. 


3. Computation of the power of the ¢ test for two independent samples and the application 
of Test 11b: Cohen'sd index In this section two methods for computing power, which are 
extensions of the methods presented for computing the power of the single-sample ¢ test, will 
be described. Prior to reading this section the reader should review the discussion of power in 
Section VI of the latter test. 

Method 1 for computing the power of the t test for two independent samples The first 
procedure to be described is a graphical method which reveals the logic underlying the power 
computations for the t test for two independent samples. In the discussion to follow, it will be 
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assumed that the null hypothesis is identical to that employed for Example 11.1 (i.e., 
Hy: Kı - p, = 0, which, as previously noted, is another way of writing Hy: p, = pj). It will 
also be assumed that the researcher wants to evaluate the power of the ¢ test for two 
independent samples in reference to the following alternative hypothesis: H,: |u; - m| > 5 
(which is the difference obtained between the sample means in Example 11.1). In other words, 
it is predicted that the absolute value of the difference between the two means is equal to or 
greater than 5. The latter alternative hypothesis is employed in lieu of Hy: p, - p, * 0 (which 
can also be written as H,: p, * p), since in order to compute the power of the test, a specific 
value must be stated for the difference between the population means. Note that, as stated, the 
alternative hypothesis stipulates a nondirectional analysis, since it does not specify which of the 
two means will be the larger value. It will be assumed that œ =.05 is employed in the analysis. 
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Figure 11.1 Visual Representation of Power for Example 11.1 


Figure 11.1, which provides a visual summary of the power analysis, is comprised of two 
overlapping sampling distributions of difference scores. The distribution on the left, which will 
be designated as Distribution A, is a sampling distribution of difference scores that has a mean 
value of zero (i.e, pp = My x, = = 0). This latter value will be represented by Bp, = = 0 in 
Figure 11.1. Distribution A represents the sampling distribution that describes the distribution 
of difference scores if the null hypothesis is true. The distribution on the right, which will be 
designated as Distribution B, is a sampling distribution of difference scores that has a mean value 
of 5 (i.e., up = Mx x, = = 5). This latter value will be represented by Hp, = 5 in Figure 11.1. 
Distribution B represents the sampling distribution that describes the distribution of difference 
scores if the alternative hypothesis is true. It will be assumed that each of the sampling dis- 
tributions has a standard deviation that is equal to the value computed for the standard error of 
the difference in Example 11.1 (ie., sy _ %5 = 2.55), since the latter value provides the best 
estimate of the standard deviation of the difference scores for the underlying populations. 

In Figure 11.1, area (///) delineates the proportion of Distribution A that corresponds to 
the value a/2, which equals .025. This is the case, since a = .05 and a two-tailed analysis is 
conducted. Area (=) delineates the proportion of Distribution B that corresponds to the 
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probability of committing a Type II error (B). Area (\\\) delineates the proportion of Distribution 
B that represents the power of the test (i.e., 1 — p). 

The procedure for computing the proportions documented in Figure 11.1 will now be 
described. The first step in computing the power of the test requires one to determine how large 
a difference there must be between the sample means in order to reject the null hypothesis. In 
order to do this, we algebraically transpose the terms in Equation 11.1, using Sy - x, to sum- 
marize the denominator of the equation, and 7. (the tabled critical two-tailed .05 t value) to 
represent t. Thus: X, - X, = (oS. .x). By substituting the values fọ; = 2.31 and 


Sg y = 2.55 in the latter equation, we determine that the minimum required difference is 
1 2 


X, - X, = (2.31)(2.55) = 5.89. Thus, any difference between the two population means that 
is equal to or greater than 5.89 will allow the researcher to reject the null hypothesis at the .05 
level. 

The next step in the analysis requires one to compute the area in Distribution B that falls 
between the mean difference Hp, = 5 (i.e., the mean of Distribution B) and a mean difference 
equal to 5.89 (represented by the notation X, = 5.89 in Figure 11.1). This is accomplished by 
employing Equation 11.1. In using the latter equation, the value of X, is represented by 5.89 and 
the value of X, by Bp, = 5. 


By interpolating the values listed in Table A2 for df = 8, we determine that the propor- 
tion of Distribution B that lies to the right of a t score of .35 (which corresponds to a mean 
difference of 5.89) is approximately .38. The latter area corresponds to area (W) in Distribution 
B. Note that the left boundary of area (W) is also the boundary delineating the extreme 2.5% of 
Distribution A (i.e., 0/2 = .025, which is the rejection zone for the null hypothesis). Since area 
(\\\) in Distribution B overlaps the rejection zone in Distribution A, area (W) represents the 
power of the test — i.e., it represents the likelihood of rejecting the null hypothesis if the 
alternative hypothesis is true. The likelihood of committing a Type II error () is represented by 
area (=), which comprises the remainder of Distribution B. The proportion of Distribution B 
that constitutes this latter area is determined by subtracting the value .38 from 1. Thus: 
B-1-.38- .62. 

Based on the results of the power analysis, we can state that if the alternative hypothesis 
H: |u; - m| = 5 is true, the likelihood that the null hypothesis will be rejected is .38, and at 
the same time there is a .62 likelihood that it will be retained. If the researcher considers the 
computed value for power too low (which in actuality should be determined prior to conducting 
a study), she can increase the power of the test by employing a larger sample size. 


Method 2 for computing the power of the t test for two independent samples employing Test 
11b: Cohen'sd index Method 2, the quick computational method described for computing the 
power of the single sample f test, can be extended to the f test for two independent samples. 
In using the latter method, the researcher must stipulate an effect size (d), which in the case of the 
t test for two independent samples is computed with Equation 11.10. The effect size index 
computed with Equation 11.10 was developed by Cohen (1977, 1988), and is known as Cohen's 
d index. Further discussion of Cohen's d index can be found in the next section dealing with 
magnitude of treatment effect, as well as in Section IX (the Appendix) of the Pearson product- 
moment correlation coefficient (Test 28) under the discussion of meta-analysis and related 
topics. 
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gan! (Equation 11.10) 


o 


The numerator of Equation 11.10 represents the hypothesized difference between the two 
population means. As is the case with the graphical method described previously, when a power 
analysis is conducted after the mean of each sample has been obtained, the difference between 
the two sample means (i.e., X, - X,) is employed as an estimate of the value of |u, - j|. It 
is assumed that the value of the standard deviation for the variable being measured is the same 
in each of the populations, and the latter value is employed to represent o in the denominator of 
Equation 11.10 (i.e., o = 0, = 0). In instances where the standard deviations of the two 
populations are not known or cannot be estimated, the latter value can be estimated from the 
sample data. Because of the fact that $, will usually not equal §,, a pooled estimated population 
standard deviation (§,) can be computed with Equation 11.11 (which is the square root of $; 
discussed in Section IV with reference to Equation 11.3). 


(n - D$] + m - 08; 


n 





(Equation 11.11) 
*n,-2 
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Since the effect size computed with Equation 11.10 is based on population parameters, it 
is necessary to convert the value of d into a measure that takes into account the size of the 
samples (which is a relevant variable in determining the power of the test). This measure, as 
noted in the discussion of the single-samplet test, is referred to as the noncentrality parameter. 
Equation 11.12 is employed to compute the noncentrality parameter (9) for the t test for two 
independent samples. When the sample sizes are equal, the value of n in Equation 11.12 will 
be n = n, = n,. When the sample sizes are unequal, the value of n will be represented by the 
harmonic mean of the sample sizes, which is described later in this section. 


6-d 5 (Equation 11.12) 


The power of the ¢ test for two independent samples will now be computed using the data 
for Example 11.1. For purposes of illustration, it will be assumed that the minimum difference 
between the population means the researcher is trying to detect is the observed 5 point difference 
between the two sample means — i.e., |X, - X,| = |2.8 - 7.8] = 5 = |u, - m|. The value 
of o employed in Equation 11.10 is estimated by computing a pooled value for the standard 
deviation using Equation 11.11. Substituting the relevant values from Example 11.1 in Equation 
11.11, the value $, - 4.02 is computed. 
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Substituting |u, - m| = 5 and o = 4.02 in Equation 11.10, the value d = 1.24 is 
computed. 


cst. c da 
4.02 


Cohen (1977; 1988, pp. 24—27) has proposed the following (admittedly arbitrary) d values 
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as criteria for identifying the magnitude of an effect size: a) A small effect size is one that is 
greater than .2 but not more than .5 standard deviation units; b) A medium effect size is one that 
is greater than .5 but not more than .8 standard deviation units; and c) A large effect size is 
greater than .8 standard deviation units. Employing Cohen's (1977, 1988) guidelines, the value 
d = 1.24 (which represents 1.24 standard deviation units) is categorized as a large effect size. 

Along with the value n = 5 (since n, = n, = n = 5), the value d = 1.24 is substituted in 
Equation 11.12, resulting in the value 6 = 1.96. 


6 = 1.24 E = 1.96 
2 


The value 6 = 1.96 is evaluated with Table A3 (Power Curves for Student's t 
Distribution) in the Appendix. We will assume that for the example under discussion a two- 
tailed test is conducted with a = .05, and thus Table A3-C is the appropriate set of power curves 
to employ for the analysis. Since there is no curve for df= 8, the power of the test will be based 
on a curve that falls in between the df= 6 and df= 12 power curves. Through interpolation, the 
power of the ¢ test for two independent samples is determined to be approximately .38 (which 
is the same value that is obtained with the graphical method). Thus, by employing 5 subjects in 
each group the researcher has a probability of .38 of rejecting the null hypothesis if the true 
difference between the population means is equal to or greater than 1.24 standard deviation units 
(which in Example 11.1 is equivalent to a 5 point difference between the means). It should be 
noted that in employing Equation 11.12, the smaller the value of n the smaller the computed 
value of ò, and consequently the lower the power of the test. 

It was noted earlier in the discussion that when the sample sizes are unequal, the value of 
n in Equation 11.12 is represented by the harmonic mean of the sample sizes (which will be 
represented by the notation 7t,). The harmonic mean is computed with Equation 11.13. 


ERES: (Equation 11.13) 
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Where: k= The number of groups 
n, = The number of subjects in the j " group 


The use of Equation 11.13 will be illustrated for a case in which there is a total of 10 
subjects, but there is an unequal number of subjects in each group. Thus, let us assume that 
n, - 7 and n, - 3. The harmonic mean can be computed as follows: n, = 2/[(0/7) + (1/3)] 
= 4.20. The reader should take note of the fact that the value n, = 4.20 computed for the 
harmonic mean is lower than the average number of subjects per group (n), which is computed 
tobe n = (7 + 3)/2 = 5. In point of fact, n, will always be lower than n unless n, = n,, in 
which case n, = n. Since, when n, * n,, n, will always be less than n, it follows that when 
n, * n,, the value computed for the power of the test when the harmonic mean is employed in 
Equation 11.12 will be less than the value that is computed for the power of the test if n is 
employed to represent the value of n. This translates into the fact that for a specific total sample 
size (i.e., n, + n), the power of the t test for two independent samples will be maximized 
when n; = n. 

As is the case with power computations for the single-sample f test, as long as a researcher 
knows or is able to estimate (from the sample data) the population standard deviation, by 
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employing trial and error she can substitute various values of n in Equation 11.12, until the 
computed value of 5 corresponds to the desired power value for the ¢ test for two independent 
samples for a given effect size. This process can be facilitated by employing tables developed 
by Cohen (1977, 1988), which allow one to determine the minimum sample size necessary in 
order to achieve a specific level of power in reference to a given effect size. 


4. Measure of magnitude of treatment effect for the ¢ test for two independent samples: 
Omega squared (Test 11c) At the conclusion of an experiment a researcher may want to 
determine the proportion of the variability on the dependent variable that is associated with the 
experimental treatments (1.e., the independent variable). This latter value is commonly referred 
to as the treatment effect. Unfortunately, the ¢ value computed for the £ test for two inde- 
pendent samples does not in itself provide information regarding the magnitude of a treatment 
effect. The reason for this is that the absolute value of t is not only a function of the treatment 
effect, but is also a function of the size of the sample employed in an experiment. Since the 
power of a statistical test is directly related to sample size, the larger the sample size, the more 
likely a significant t value will be obtained if there is any difference between the means of the 
underlying populations. Regardless of how small a treatment effect is present, the magnitude of 
the absolute value of t will increase as the size of the sample employed to detect that effect 
increases. Thus, a t value that is significant at any level (be it .05, .01, .001, etc.) can result from 
the presence of a large, medium, or small treatment effect. 

Before describing measures of treatment effect for the t test for two independent samples, 
the distinction between statistical significance and practical significance (which is discussed 
briefly in the Introduction) will be clarified. Whereas statistical significance only indicates that 
a difference between the two group means has been detected and that the difference is unlikely 
to be the result of chance, practical significance refers to the practical implications of the 
obtained difference. As just noted, by employing a large sample size, a researcher will be able 
to detect differences between means that are extremely small. Although in some instances a 
small treatment effect can be of practical significance, more often than not a minimal difference 
will be of little or no practical value (other than perhaps allowing a researcher to get a study 
published, since significant results are more likely to be published than nonsignificant results). 
On the other hand, the larger the magnitude of treatment effect computed for an experiment, the 
more likely the results have practical implications. To go even further, when a researcher 
employs a small sample size, it is possible to have a moderate or large treatment effect present, 
yet not obtain a statistically significant result. Obviously, in such an instance (which represents 
an example of a Type II error) the computed t value is misleading with respect to the truth 
regarding the relationship between the variables under study. 

A number of indices for measuring magnitude of treatment effect have been developed. 
Unlike the computed t value, measures of magnitude of treatment effect provide an index of the 
degree of relationship between the independent and dependent variables that is independent of 
sample size. A major problem with measures of treatment effect is that for most experimental 
designs two or more such measures are available which are not equivalent to one another, and 
researchers are often not in agreement with respect to which measure is appropriate to employ. 
In the case of the t test for two independent samples, the most commonly employed measure 
of treatment effect is omega squared. The statistic that is computed from the sample data to 
estimate the value of omega squared is represented by the notation à (w is the lower case 
Greek letter omega). This latter value provides an estimate of the underlying population 
parameter o», which represents the proportion of variability on the dependent variable that is 
associated with the independent variable in the underlying population. The value of à» is 
computed with Equation 11.14. 
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Oo = (Equation 11.14) 


Although the value of à will generally fall in the range between 0 and 1, when |t| <1, 
6” will be a negative number. The closer @” is to 1, the stronger the association between the 
independent and dependent variables, whereas the closer &? is to 0, the weaker the association 
between the two variables. A 6” value equal to or less than 0 indicates that there is no 
association between the variables." 

Employing Equation 11.14 with the data for Example 11.1, the value à? = .22 is 
computed. 


Gi - (-1.96)* - 1 


aa MOREM SU 2030 
(-1.96% -5 «5-1 


The value à? = .22 indicates that 22% (or a proportion equal to .22) of the variability on 
the dependent variable (the depression ratings of the subjects) is associated with variability on 
the levels of the independent variable (the drug versus placebo conditions). To say it another 
way, 2296 of the variability on the depression scores can be accounted for on the basis of which 
group a subject is a member. 

Cohen (1977; 1988, pp. 285—288) has suggested the following (admittedly arbitrary) values, 
which are employed in psychology and a number of other disciplines, as guidelines for 
interpreting @*: a) A small effect size is one that is greater than .0099 but not more than .0588; 
b) A medium effect size is one that is greater than .0588 but not more than .1379; and c) A large 
effect size is greater than .1379.'° In the case of Example 11.1, if one employs Cohen's (1977, 
1988) guidelines, the obtained value à? = .22 indicates the presence of a large treatment effect. 

Keppel (1991) and Keppel et al. (1992) note that in the experimental literature in the 
discipline of psychology it is unusual for a &? value to exceed .25 — indeed, one review of the 
psychological literature yielded an average à value of .06. The inability of researchers to 
control experimental error with great precision is the most commonly cited reason for the low 
value obtained for &? in most studies. 

Eta squared ( fj^) is an alternative but less commonly used measure of association that can 
also be employed to evaluate the magnitude of a treatment effect for a£ test for two independent 
samples. The eta squared (Test 21h) statistic is described in Section VI of the single-factor 
between-subjects analysis of variance. Most sources note that eta squared results in a more 
biased estimate of the degree of association in the underlying population than does omega 
squared. Eta squared is also discussed under the Pearson product-moment correlation coef- 
ficient with reference to the point-biserial correlation coefficient (Test 28h). In the latter 
discussion, it is demonstrated that eta squared and the point-biserial correlation coefficient 
are equivalent measures when they are used to evaluate magnitude of treatment effect for a 
design involving two independent samples. In the discussion of the single-factor between- 
subjects analysis of variance, it is noted that if ji? and à? are computed for the same set of 
data, they yield different values. The issue of the lack of agreement among different magnitude 
of treatment effect measures is considered in more detail in Section VI of the single-factor 
between-subjects analysis of variance. 

In closing the discussion of magnitude of treatment effect, it should be noted that at the 
present time many sources recommend that in reporting the results of an experimental analysis 
with an inferential statistical test (such as the ¢ test for two independent samples), in addition 
to reporting the computed test statistic (e.g., a t value), the researcher should also present a 
measure of the magnitude of treatment effect (e.g., à), since, by including the latter value, one 
is providing additional information that can further clarify the nature of the relationship between 
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the variables under study. A general discussion of the latter issue, as well as a additional dis- 
cussion of measures of treatment effect can be found in Section IX (the Addendum) of the 
Pearson product-moment correlation coefficient under the discussion of meta-analysis and 
related topics. 


5. Computation of a confidence interval for the ¢ test for two independent samples Prior to 
reading this section the reader should review the discussion on the computation of confidence 
intervals in Section VI of the single-sample ¢ test. When interval/ratio data are available for two 
independent samples, a confidence interval can be computed that identifies a range of values 
within which one can be confident to a specified degree that the true difference lies between the 
two population means. Equation 11.15 is the general equation for computing the confidence 
interval for the difference between the means of two independent populations. 


Goo 5506039 EL GE 53) (Equation 11.15) 


(1 - a) 

Where: /,, represents the tabled critical two-tailed value in the f£ distribution, for 
df = n, + n, - 2, below which a proportion (percentage) equal to [1 - (0/2)] of 
the cases falls. If the proportion (percentage) of the distribution that falls within the 
confidence interval is subtracted from 1 (100%), it will equal the value of a 


Employing Equation 11.15, the 95% interval for Example 11.1 is computed below. In 
employing Equation 11.15, (X, - X,) represents the obtained difference between the group 
means (which is the numerator of the equation used to compute the value of 1 ), 1 , represents 
the tabled critical two-tailed .05 value for df = n, + n, - 2,and Sx - y represents the standard 
error of the difference (which is the denominator of the equation used to compute the value of 
t): 


Clos = (X, - X) + Cossg gz) = -5 + (2.31)(2.55) = -5 + 5.89 


1 


-10.89 < (p, - m) < .89 


This result indicates that the researcher can be 95% confident (or the probability is .95) that 
the true difference between the population means falls within the range —10.89 and .89. 
Specifically, it indicates that one can be 95% confident (or the probability is .95) that the mean 
of the population Group 2 represents is no more than 10.89 points higher than the mean of 
population that Group 1 represents, and that the mean of population that Group 1 represents is 
no more than .89 points higher than the mean of population that Group 2 represents. 

Note that in using the above notation, when a confidence interval range involves both a 
negative and positive limit (as is the case in the above example), it indicates that it is possible for 
either of the two population means to be the larger value. If, on the other hand, both limits 
identified by the confidence interval are positive values, the mean of Population 1 will always 
be greater than the mean of Population 2. If both limits identified by the confidence interval are 
negative values, the mean of Population 2 will always be greater than the mean of Population 1. 

The 99% confidence interval for Example 11.1 will also be computed to illustrate that the 
range of values that define a 99% confidence interval is always larger than the range which 
defines a 95% confidence interval. 


CI y = (X, - X) t (t Gg. -x) = -5 + (3.36)(2.55) = -5 + 8.57 
-13.57 < (p; - m) < 3.57 
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Thus, the researcher can be 99% confident (or the probability is .99) that the true difference 
between the population means falls within the range —13.57 and 3.57. Specifically, it indicates 
that one can be 99% confident (or the probability is .99) that the mean of the population Group 
2 represents is no more than 13.57 points higher than the mean of population that Group 1 
represents, and that the mean of population that Group 1 represents is no more than 3.57 points 
higher than the mean of population that Group 2 represents. In closing the discussion of confi- 
dence intervals, it is worth noting that the broad range of values that define the above computed 
confidence intervals will not allow a researcher to estimate with great precision the actual differ- 
ence between the means of the underlying populations. Additionally, the reader should take note 
of the fact that the reliability of Equation 11.15 will be compromised if one or more of the 
assumptions of the ¢ test for two independent samples are saliently violated. 


6. Test 11d: Thez test for two independent samples There are occasions (albeit infrequent) 
when a researcher wants to compare the means of two independent samples, and happens to 
know the variances of the two underlying populations. In such a case, the z test for two 
independent samples should be employed to evaluate the data instead of the t test for two 
independent samples. As is the case with the latter test, the z test for two independent 
samples assumes that the two samples are randomly selected from populations that have normal 
distributions. The effect of violation of the normality assumption on the test statistic decreases 
as the size of the samples employed in an experiment increase. The homogeneity of variance 
assumption noted for the ¢ test for two independent samples is not an assumption of the z test 
for two independent samples. 

The null and alternative hypotheses employed for the z test for two independent samples 
are identical to those employed for the ¢ test for two independent samples. Equation 11.16 is 
employed to compute the test statistic for the z test for two independent samples. 

L X, MS i 
z = — (Equation 11.16) 





The only differences between Equation 11.16 and Equation 11.1 (the equation for the ¢ test 
for two independent samples) are: a) In the denominator of Equation 11.16 the population 
variances o and o, are employed instead of the estimated population variances A and & 
(which are employed in Equation 11.1); and b) Equation 11.16 computes a z score which is 
evaluated with the normal distribution, while Equation 11.1 derives a t score which is evaluated 
with the t distribution. Unlike Equation 11.1, Equation 11.16 can be used with both equal and 
unequal sample sizes.” 

If it is assumed that the two population variances are known in Example 11.1, and that 
o? - 21.7 and a = 10.7, Equation 11.16 can be employed to evaluate the data. Note that the 
obtained value z = —1.96 is identical to the value that is computed for t when Equation 11.1 is 
employed. 


EC TBs wae 
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The obtained value z = -1.96 is evaluated with Table A1 (Table of the Normal 
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Distribution) in the Appendix. In Table A1 the tabled critical two-tailed .05 and .01 values are 
Zos = 1.96 and z,, = 2.58, and the tabled critical one-tailed .05 and .01 values are 
Zos = 1.65 and z,, = 2.33. Since the computed absolute value z = 1.96 is equal to the tabled 
critical two-tailed value zo; = 1.96, the nondirectional alternative hypothesis H,: p, * m, is 
supported at the .05 level. Since the computed value z = —1.96 is a negative number and the 
absolute value of z is greater than the tabled critical one-tailed .05 value zo, = 1.65, the direc- 
tional alternative hypothesis H,: p, < p, is also supported at the .05 level. 

When the same set of data is evaluated with the f test for two independent samples, 
although the directional alternative hypothesis H,: p, < p, is supported at the .05 level, the 
nondirectional alternative hypothesis H,: p, * m, is not supported. This latter fact illustrates 
that if the z test for two independent samples and the ¢ test for two independent samples are 
employed to evaluate the same set of data (except when the value of n, + n, - 2 is extremely 
large), the latter test will provide a more conservative test of the null hypothesis (i.e., make it 
more difficult to reject H,). This is the case, since the tabled critical values listed for the z test 
for two independent samples will always correspond to the tabled critical values listed in Table 
A2 for df = œ (which are the lowest tabled critical values listed for the f distribution). 

The final part of the discussion of the z test for two independent samples will describe a 
special case of the test in which it is employed to evaluate the difference between the average 
performance of two samples for whom scores have been obtained on a binomially distributed 
variable. Example 11.3, which is used to illustrate this application of the test, is an extension of 
Example 9.6 (discussed under the z test for a population proportion (Test 9a)) to a design 
involving two independent samples. 


Example 11.3 An experiment is conducted in which the performance of two groups is 
contrasted on a test of extrasensory perception. The two groups are comprised of five subjects 
who believe in extrasensory perception (Group 1) and five subjects who do not believe in it 
(Group 2). The researcher employs as test stimuli a list of 200 binary digits (specifically, the 
values 0 and 1) which have been randomly generated by a computer. During the experiment an 
associate of the researcher concentrates on each of the digits in the order it appears on the list. 
While the associate does this, each of the ten subjects, all of whom are in separate rooms, 
attempts to guess the value of the number for each of 200 trials. The number of correct guesses 
for the two groups of subjects follow: Group 1: 105, 120, 130, 115, 110; Group 2: 104, 99, 
90, 100, 107. Is there a difference in the performance of the two groups? 


The null and alternative hypotheses evaluated in Example 11.3 are identical to those 
evaluated in Example 11.1. Example 11.3 is evaluated with Equation 11.17, which is the form 
Equation 11.16 assumes when o; - oj. Note that in Equation 11.17, m, is employed to 
represent the number of subjects in the j” group, since the notation n is employed with a 
binomially distributed variable to designate the number of trials each subject is tested. Thus, 


m, =m, - 5. 


z= ——————— (Equation 11.17) 


_ Inemploying Equation 11.17, we first compute the average score of each of the groups: 
X, = 580/5 = 116 and X, = 500/5 = 100. Since scores on the binary guessing task described 
in Example 11.3 are assumed to be binomially distributed, as is the case in Example 9.6, the 
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following is true: n = 200, v, = .5, and m, = .5. The computed value for the population 
standard deviation for the binomially distributed variable is o = nnm, = y(200)(.5)(.5) = 
7.07. (The computation of the latter values is discussed in Section I of the binomial sign test for 
a single sample (Test 9).) When the appropriate values are substituted in Equation 11.17, the 
value z = 3.58 is computed. 


gis 18500. cones 


7.07 


alr 
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Since the computed absolute value z = 3.58 is greater than the tabled critical two-tailed 
values Zo; = 1.96 and z,, = 2.58, the nondirectional alternative hypothesis H,: p, * p, is 
supported at both the .05 and .01 levels. Since the computed value z = 3.58 is a positive number 
that is greater than the tabled critical one-tailed values zo; = 1.65 and z,, = 2.33, the direc- 
tional alternative hypothesis H,: p, > p, is supported at both the .05 and .01 levels. Thus, it 
can be concluded that the average score of Group 1 is significantly larger than the average score 
of Group 2. 

After employing Equation 11.17 to evaluate an experiment such as the one described in 
Example 11.3, a researcher may want to determine whether either of the group averages is above 
or below the expected mean value (which, in Example 11.3, is u = nm, = (200)(.5) = 100). 
Equation 9.10 is employed to evaluate the performance of a single group. The null hypothesis 
that is employed for evaluating a single group is H,: Tt, = .5 (which, for Example 11.3, is 
commensurate with H,: By = 100), and the nondirectional alternative hypothesisis H,: x, # .5 
(which, for Example 11.3, is commensurate with H: H; * 100). Since it is obvious from 
inspection of the data that the performance of Group 2 is at chance expectancy, the performance 
of Group 1, which is above chance, will be evaluated with Equation 9.10. 


X-p _ 116 - 100 
o 7.07 


ym v5 


Since the computed value z = 5.06 is greater than the tabled critical two-tailed values 
Zos = 1.96 andz,, = 2.58, the nondirectional alternative hypothesis H,: v, # .5 is supported 
at both the .05 and .01 levels. Since the computed value z = 5.06 is a positive number that is 
greater than the tabled critical one-tailed values zo, = 1.65 and Z = 2.33, the directional 
alternative hypothesis H,: v, > .5 is also supported at both the .05 and .01 levels. Thus, it can 
be concluded that the average performance of Group 1 is significantly above chance. 





Z = = 5.06 


VII. Additional Discussion of the t Test for Two Independent 
Samples 


1. Unequal sample sizes In Section IV it is noted that if Equation 11.1/11.2 (which can only 
be used when n, = n,)is applied to data where n, + n,, the absolute value of t will be larger 
than the value computed with Equation 11.3. To illustrate this point, Equations 11.1 and 11.3 
will be applied to a modified form of Example 11.1. Specifically, one of the scores in Group 2 
will be eliminated from the data. If the score of Subject 1 (1.e., 11) is eliminated, the four scores 
that remain are: 11, 5, 8, 4. Employing the latter values, n, = 4, XX, = 28, EX? = 226, 
X, = 7,52 = [226 - (28}/4]/(4 - 1) = 10. RrGapltevissn, = 5,X, = 2.8,ad$, = 21.7 
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remain unchanged. The relevant values are substituted in Equation 11.1. 
t = E A s -1.61 
21.7 10 
—— + 


5 4 
The same information is now substituted in Equation 11.3. 


tz-— Y 22BIT = -1.53 


(S5 - 1)1.7) + 4 - Dd) |1 , 1 
5+4-2 5 4 





Note that the absolute value ? = 1.61 computed with Equation 11.1 is larger than the 
absolute value ¢ = 1.53 computed with Equation 11.3, and thus Equation 11.3 provides a more 
conservative test of the null hypothesis. It so happens that in this instance neither of the 
computed ¢ values allows the null hypothesis to be rejected since, for df = 5 + 4 - 2 - 7, 
both absolute values are below the tabled critical two-tailed values ty), = 2.37 and t9, = 3.50, 
and the tabled critical one-tailed values ty, = 1.90 and t,, = 3.00. 


2. Robustness of the t test for two independent samples?" Some statisticians believe that if 
one or more of the assumptions of a parametric test (such as the ¢ test for two independent 
samples) are saliently violated, the test results will be unreliable, and because of this under such 
conditions it is more prudent to employ the analogous nonparametric test, which will generally 
have fewer or less rigorous assumptions than its parametric analog. In the case of the ¢ test for 
two independent samples the most commonly employed analogous nonparametric tests are the 
Mann-Whitney U test (Test 12) and the chi-square test for r x c tables (Test 16). Use of the 
Mann-Whitney U test (which is most likely to be recommended when the normality assumption 
of the £ test for two independent samples is saliently violated) requires that the original 
interval/ratio scores be transformed into a rank-order format. By virtue of rank-ordering the data, 
information is sacrificed (since rank-orderings do not provide information regarding the magni- 
tude of the differences between adjacent ranks). Given the fact that the Mann-Whitney U test 
employs less information, many researchers if given the choice will still elect to employ the t test 
for two independent samples, even if there is reason to believe that the normality assumption 
of the latter test is violated. Under such conditions, however, most researchers would probably 
conduct a more conservative f test in order to avoid inflating the likelihood of committing a Type 
I error (i.e., one might employ the tabled critical 7, value to represent the 7 ; value instead of 
the actual value listed for £9). In the unlikely event that a researcher elects to employ the 
chi-square test for r x c tables in place of the f test for two independent samples, he must 
convert the original interval/ratio data into a categorical format. This latter type of a transfor- 
mation will result in an even greater loss of information than is the case with the Mann-Whitney 
U test. 

The justification for using a parametric test in lieu of its nonparametric analog, even when 
one or more of the assumptions of the former test are violated, is that the results of numerous 
empirical sampling studies have demonstrated that under most conditions a parametric test like 
the test for two independent samples is reasonably robust. A robust test is one that still pro- 
vides reliable information about the underlying sampling distribution, in spite of the fact that one 
or more of the test' s assumptions have been violated. In addition, researchers who are reluctant 
to employ nonparametric tests argue that parametric tests, such as the ¢ test for two independent 
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samples, are more powerful than their nonparametric analogs. Proponents of nonparametric tests 
counter with the argument that the latter group of tests are almost equivalent in power to their 
parametric analogs, and because of this, they state that it is preferable to use the appropriate 
nonparametric test if any of the assumptions of a parametric test have been saliently violated. 
Throughout this book it is demonstrated that in most instances when the same set of data is 
evaluated with both a parametric and a nonparametric test (especially a nonparametric test 
employing rank-order data), the two tests yield comparable results. As a general rule, in 
instances where only one of the two tests is significant, the parametric test is the one that is more 
likely to be significant. However, in most cases where a parametric test achieves significance 
and the nonparametric test does not, the latter test will fall just short of being significant. In 
instances where both tests are significant, the alpha level at which the result is significant will 
generally be lower for the parametric test. 


3. Outliers (Test 11e: Procedures for identifying outliers) and data transformation An 
outlier is an observation (or subset of observations) in a set of data that does not appear to be 
consistent with the rest of the data. In most instances inconsistency is reflected in the magnitude 
of an observation (i.e., itis either much higher or much lower than any of the other observations). 
Yet what appears to be an inconsistent/extreme score to one observer may not appear to be so 
to another. Barnett and Lewis (1994) emphasize that a defining characteristic of an outlier is that 
it elicits genuine surprise in an observer. To illustrate the fact that what may surprise one 
observer may not surprise another, we will consider an example cited by Barnett and Lewis 
(1994, p. 15). The latter authors present data described by Fisher, Corbet, and Williams (1943), 
which represents the number of moths of a specific species that were caught in light-traps 
mounted in a geographical locale in England. The following 15 observations were obtained. 


3, 3, 4, 5, 7, 11, 12, 15, 18, 24, 51, 54, 84, 120, 560 


Barnett and Lewis (1994) point out that although the value 560 might appear to be an 
observation that would surprise most observers, in point of fact, it is not an anomaly. The reason 
why 560 would not be classified as an outlier is because an experienced entomologist would be 
privy to the fact that the distribution under study is characterized by a marked skewness, and 
consequently an occasional extreme score in the upper tail such as the value 560 is a matter-of- 
fact occurrence. Thus, a researcher familiar with the phenomenon under study would not classify 
560 an outlier.” 

If at all possible, one should determine the source of any observation in a set of data that 
is viewed as anomalous, since the presence of one or more outliers can dramatically influence 
the values of both the mean and variance of a distribution. As a result of the latter, any test 
statistic computed for the data can be unreliable. As an example, assume that a researcher is 
comparing two groups with respect to their scores on a dependent variable, and that all the 
subjects in both groups except for one subject in Group 2 obtain a score between 0 and 20. The 
subject in Group 2 with an outlier score obtains a score of 200. It should be obvious that the 
presence of this one score (even if the size of the sample for Group 2 is relatively large) will 
inflate the mean and variance of Group 2 relative to that of Group 1, and because of this either 
one or both of the following consequences may result: a) A significant difference between the 
two group means may be obtained which would not have been obtained if the outlier score was 
not present in the data; and/or b) The homogeneity of variance assumption will be violated due 
to the higher estimated population variance computed for Group 2. By virtue of adjusting the 
t test statistic for violation of the homogeneity of variance assumption, a more conservative test 
will be conducted, thus making it more difficult to reject the null hypothesis.” 
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Stevens (1996) notes that there are basically two strategies that can be used in dealing with 
outliers. One strategy is to develop and employ procedures for identifying outliers. Within the 
framework of the latter strategy, criteria should be established for determining under what 
conditions one or more scores that are identified as outliers should be deleted from a set of data. 
A second approach in dealing with outliers is to develop statistical procedures that are not 
influenced (or only minimally affected) by the presence of outliers. Such procedures are com- 
monly called robust statistical procedures — the term robust, as noted earlier, referring to 
procedures which are not overly dependent on critical assumptions regarding an underlying 
population distribution. The discussion of outliers within the framework of robustness is predi- 
cated on the fact that their presence may lead to violation of one or more assumptions underlying 
a statistical test. In the case of the t test for two independent samples, the assumptions of 
concern are those of normality and homogeneity of variance. Although a number inferential 
statistical tests have been developed for identifying outliers, Sprent (1993, 1998) notes that 
ironically many of these tests lack robustness, and because of the latter may lack power with 
respect to their ability to identify the presence of one or more outliers. Sprent (1998) and Barnett 
and Lewis (1994) discuss the masking effect, which refers to the fact that a test’s power in 
identifying a specific outlier may be compromised if there are one or more additional outliers in 
the data. 

Barnett and Lewis (1994), who represent the most comprehensive source on the subject, 
describe a large number of tests for identifying outliers — they describe 48 tests alone for 
detection of outliers in data that are assumed to be drawn from a normal distribution. Some of 
the tests for identifying outliers are designed to identify a single outlier, some are designed for 
identification of multiple outliers, and some tests are specific with respect to identifying outliers 
in one or both tails of a distribution. Additionally, tests are described for detecting outliers in 
data that are assumed to be drawn from any number of a variety of nonnormal distributions (e.g., 
binomial, Poisson, gamma, exponential, etc.). Given the large number of tests that are available 
for detecting outliers, it is not unusual that two or more tests applied to the same set of data may 
not agree with one another with regard to whether a specific observation should be classified as 
an outlier. 


Test 8e: Procedures for identifying outliers At this point we will consider some informal 
criteria for identifying outliers. Stevens (1996, p. 17) notes that it has been demonstrated that 
regardless of the distribution of data, at least [1 - (1/k?)] (100) percent of the observations in 
a set of data must fall within k standard deviations of the mean (where k is any value greater than 
1). Applying the latter, we determine that when k = 2, at least 7596 of the observations will fall 
within +2 and —2 standard deviations from the mean. The percentages for when k is equal to 3, 
4, and 5 are respectively, 88.89%, 93.75%, and 96%. These values suggest that scores which are 
beyond two standard deviations from the mean are rare; that scores beyond three standard 
deviations are even rarer; and so on. Thus, scores that yield relatively high standard deviation 
values should be considered as possible outliers. 

Sprent (1993, 1998) discusses a procedure for identifying outliers which he describes as 
being relatively robust. The procedure employs Equation 11.18 to determine whether or not a 
score in a sample of n observations should be classified as an outlier. 


|X; - M| 
————— > 


Max (Equation 11.18) 
MAD 


Where: X, represents any of the n scores being evaluated with respect to whether it is an outlier 
M is the median of the n scores in the sample 
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MAD is the median absolute deviation 
Max is the critical value the result to the left of the inequality must exceed in order to 
conclude the value X, is an outlier 


In the left side of Equation 11.18, the numerator value |X; - M| represents a difference 
score. The denominator value MAD represents a measure of dispersion. ^ The equation is de- 
signed to yield a standard deviation score such as that obtained by Equation I.27 (which is the 
equation for computing a standard deviation score for a normally distributed variable). The 
value Max on the right side of Equation 11.18 represents an extreme standard deviation score that 
is employed as a criterion for classifying an outlier. 

The value of MAD in Equation 11.18 is determined as follows: a) Upon computing the 
sample median, obtain an absolute deviation score from the sample median for each of the n 
scores. The latter is done by computing the absolute value of the difference between the median 
and each score; b) When all n absolute deviation scores have been computed, arrange them 
ordinally (i.e., from lowest to highest); and c) Find the median of the n absolute deviation scores. 
The latter value represents the value of MAD to employ in Equation 11.18. 

To determine whether any of the n scores in the sample is an outlier, the following protocol 
is employed: a) Select the score that deviates by the greatest amount from the median to initially 
represent the value of X,; b) Subtract the median from the value of X,, and divide the difference 
(which will always be a positive value since it is an absolute value) by MAD; c) If the value 
obtained in b) is greater than Max, one can conclude that X, is an outlier. If the value obtained 
in b) is equal to or less than Max, one cannot conclude that X; is an outlier; d) If it is concluded 
that the score selected to represent X, is not an outlier, terminate the analysis. If it is concluded 
that the score selected to represent X, is an outlier, select the score that has the second greatest 
deviation from the median and repeat steps b) and c); and e) Continue substituting X; values in 
Equation 11.18 until an X; value is identified that is not an outlier. The reader should take note 
of the fact that in most cases it is assumed that few if any observations within a given sample will 
be identified as outliers. 

Sprent (1993, p. 278) notes that the selection of the value to represent Max in Equation 
11.181s somewhat arbitrary. He recommends that 5 is a reasonable value to employ, since if one 
assumes the data are derived from an approximately normally distributed population, the value 
Max = 5 will be extremely likely to identify scores that deviate from the mean by more than three 
standard deviations. (From Column 3 of Table A1 we can determine that the likelihood of a 
score being greater or less than three standard deviations from the mean is .0026.) It should be 
noted that if there is reason to believe that the sample data are derived from a population that is 
not normally distributed, the choice of what value to employ to represent Max becomes more 
problematical. The reader should consult sources which discuss outliers in greater detail for a 
more in-depth discussion of the latter issue (e.g., Barnett and Lewis (1994) and Sprent (1993, 
1998)). 

To illustrate the application of Equation 11.18, assume we have a sample consisting of the 
following five scores: 2, 3, 4, 7, 18. Following the protocol described above: a) We determine 
that the median (i.e., the middle score) of the sample is 4; b) We compute the absolute deviation 
of each score from the median: |2 -4| 22, |3- 4| 2 1, |4A- 4| 20, |7 - 4|23, |18 - 4| = 14. 
Arranging the four deviation scores ordinally (0, 1, 2, 3, 14), we determine the median of the five 
deviation scores is 2. The latter value will represent MAD in Equation 11.18; c) Since the only 
value we would suspect to be an outlier is the score of 18, we employ that value to represent X; 
in Equation 11.18; d) Since we will assume the data are derived from a normal distribution, we 
employ the value Max = 5; e) Substituting the appropriate values in the left side of Equation 
11.18, we compute |18 — 4|/2 2 7. Since the obtained value of 7 is greater than Max = 5, we 
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conclude that the score of 18 is an outlier. 

When another simple test for an outlier described in Barnett and Lewis (1994, p. 247) and 
Tietjen (1986, pp. 500—502) is applied to the same set of data, it also yields a significant result 
with respect to the value 18 being an outlier. The latter test, developed by Grubbs (1969), 
employs Equation 11.19. 


|X; E X| 


$ 


T 


n 


(Equation 11.19) 


The computed test statistic T, is referred to as the extreme Studentized residual. Use of 
Equation 11.19 to compute T, requires that the sample mean (which is based on all of the scores 
in the sample including the suspected outlier) be subtracted from the value of a suspected outlier 
(X;). The resulting difference is divided by the value computed for $ (which, like the mean, is 
based on all the scores in the sample including the suspected outlier) with Equation I.8. The 
value computed for T, is evaluated with special tables found in sources that describe the test. 
Employing Equation 11.19 with our data, T, = |18 - 6.8|/6.53 = 1.72, which it turns out is 
significant at the .05 level. In spite of the fact that in the case of our example two inferential 
tests result in the conclusion that the observation 18 is an outlier, it is important to remember that 
it is not unusual for two or more tests for detecting an outlier to disagree with one another. In 
view of the latter, the reader is advised to consult more detailed sources (such as Barnett and 
Lewis (1994)) to determine the most appropriate test for detecting outliers in a given set of data. 

It should be noted that employing the magnitude of a standard deviation score (i.e., a z 
score) as a criterion for classifying outliers can often lead to misleading conclusions. Shiffler 
(1988) has demonstrated that the largest possible absolute z value that can occur in a set of data 
is defined by the limit (n - 1)//n. For example, if n = 15, the value of z cannot exceed 
(15 - 1)/ (/15) - 3.62. Inspection of the equation (n - 1)/yn reveals that the maximum 
possible absolute value z may attain is a direct function of the sample size — in other words, the 
larger the sample size, the larger the limiting value of z. In order to appreciate how the use of 
a z value with small sample sizes can lead to misleading conclusions if it is employed as a 
criterion for classifying outliers, consider the following examples presented by Shiffler (1988): 
a) A set of data consists of the following five scores: 0, 0, 0, 0, 1 million. On inspection, some 
researchers might immediately view the score of 1 million as an outlier. Yet when the sample 
mean and estimated population standard deviation are computed for the five scores, the z score 
associated with the score of 1 million is only z = 1.78; and b) A set of data is comprised of 18 
scores, with 17 of the scores equal to 0 and the remaining score equal to 1. When the sample 
mean and estimated population standard deviation are computed for the 18 scores, the z score 
associated with the score of 0 is z 2 4.007. The magnitude of the latter z value might suggest to 
some researchers that the score of 0 is an outlier. Thus, as noted above, when the sample size 
is relatively small, the theoretical limits imposed on the value of a z score may make it 
impractical to employ it as a criterion for classifying outliers. 

Tabachnick and Fidell (1989, 1996) note that outliers present a researcher with a statistical 
problem if one elects to include them in the data analysis, and a moral problem if one is trying 
to decide whether or not to eliminate them from the analysis. In essence, how a researcher deals 
with outliers should depend on how one accounts for their presence. In instances where a re- 
searcher is reasonably sure that an outlier is due to any one of the following, Tabachnick and 
Fidell (1989, 1996) state that one has a strong argument for dropping such a score from the data: 
a) There is reason to believe that an error was made in recording the score in question (either 
human recording error or instrumentation error); b) There is reason to believe that the score is the 
result of failure on the part of a subject to follow instructions, or other behavior on the part of the 
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subject indicating a lack of cooperation and/or attention to the experiment; and c) There is reason 
to believe that the score is the result of failure on the part of the experimenter to employ the 
correct protocol in obtaining data from a subject. 

One disadvantage of removing outliers from data is that it reduces the sample size, and the 
latter can result in a decrease in statistical power. Of course, if one is conducting a parametric 
test such as a f test, one may counteract the aforementioned loss in power by virtue of the fact 
that removal of an outlier will decrease the variability in the data. Tietjen (1986) notes that 
rejecting an outlier on a purely statistical basis may only be an indication that the data are derived 
from a population other than the one the researcher assumes to be the parent population. An 
example cited by Kruskal (1960) is employed by Tietjen (1986) to demonstrate that whether or 
not an observation should be considered an outlier will be a function of the hypothesis under 
study. The example used involves a study in which the concentration of a specific chemical in 
a mixture is being evaluated. Measures of the chemical are derived from five different samples, 
but one of the measures is way out of line in relation to the others. Let us assume that the latter 
measure is the result of instrumentation error. If the purpose of the study is to estimate the 
concentration of the chemical in the mixture, the researcher may elect to label the atypical 
measure an outlier, and only use the remaining four measures to compute an average value. If 
instead, the purpose of the study is to assess the reliability of the instrument that is employed to 
measure the concentration of the chemical, the outlier observation is relevant and should be 
retained. Kruskal (1960) emphasizes the fact that the presence of one or more outliers in a set 
of data may indicate something that is of practical or theoretical relevance, and because of the 
latter, he recommends that it should be standard protocol for researchers to report the presence 
of any outliers, regardless of whether or not they elect to employ them in the analysis of the data. 

In the final analysis, excluding data that are deemed to be outliers is an extreme measure 
which entails the risk of mistakenly eliminating valid information about an underlying popu- 
lation.” At the other extreme, a researcher can include in the analysis all of the data, including 
Observations that are viewed as outliers. Obviously, the latter strategy increases the likelihood 
that the sample will be contaminated, and therefore what may emerge from the analysis may be 
a gross distortion of what is true regarding the underlying population under study. 

When a researcher has reservations about employing either extreme — removal of outliers 
versus their inclusion — an alternative strategy known as accommodation may be employed. 
The latter involves the use of a procedure which utilizes all the data, but at the same time mini- 
mizes the influence of outliers. Two obvious options within the framework of accommodation 
that reduce the impact of outliers are: a) Use of the median in lieu of the mean as a measure of 
central tendency; b) Employing an inferential statistical test that uses rank-orders instead of 
interval/ratio data. 

Accommodation is often described within the context of employing a robust statistical 
procedure (e.g., a procedure that assigns weights to the different observations when calculating 
the sample mean). Two commonly used methods for dealing with outliers (which are sometimes 
discussed within the context of accommodation) are trimming and Winsorization. 


Trimming data Trimming involves removing a fixed percentage of extreme scores from each 
of the tails of any of the distributions that are involved in the data analysis. As an example, in 
an experiment involving two groups, one might decide to omit the two highest and two lowest 
scores in each group (since by doing this, any outliers in either group would be eliminated). Sprent 
(1993) notes that common trimming levels are the top and bottom deciles (i.e., the extreme 10% 
from each tail of the distribution) and the first and fourth quartiles (i.e., the extreme 25% from 
each tail). The latter trimmed mean (i.e., the mean computed by only taking into account scores 
that fall in the middle 50% of the distribution) is often referred to as the interquartile mean. In 
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addition to using trimming for reducing the impact of outliers, it is also employed when a dis- 
tribution has heavy or long tails (i.e., a relatively flat distribution with a disproportionate number 
of observations falling in the tails). 


Winsorization Sprent (1993, p. 69) notes that the rationale underlying Winsorization is that 
the outliers may provide some useful information concerning the magnitude of scores in the 
distribution, but at the same time may unduly influence the results of the analysis unless some 
adjustment is made. Winsorization, involves replacing a fixed number of extreme scores with 
the score that is closest to them in the tail of the distribution in which they occur. As an example, 
in the distribution 0, 1, 18, 19, 23, 26, 26, 28, 33, 35, 98, 654 (which has a mean value of 80.08), 
one can substitute a score of 18 for both the 0 and 1 (which are the two lowest scores), and a 
score of 35 for the 98 and 654 (which are the two highest scores). Thus, the Winsorized 
distribution will be: 18, 18, 18, 19, 23, 26, 26, 28, 33, 35, 35, 35. The mean of the Winsorized 
distribution will be 26.17. 

Barnett and Lewis (1994) note that the critical problem associated with trimming and 
Winsorization is selecting the number of scores that are trimmed or Winsorized. Assume that 
r represents the number of scores to be trimmed or Winsorized in the right tail, and that / 
represents the number of scores to be trimmed or Winsorized in the left tail. If r = J, the 
trimming or Winsorization process is considered symmetric. If r # l, the trimming or 
Winsorization process is considered asymmetric. The issue of whether a researcher believes one 
or both tails of a sample distribution may be contaminated is one of a number of considerations 
that have to be taken into account in determining the most appropriate procedure to use. In 
Chapter 5 of their book, Barnett and Lewis (1994) address such issues, as well as modified 
trimming and Winsorization procedures. 


Data transformation Data transformation involves performing a mathematical operation on 
each of the scores in a set of data, and thereby converting the data into a new set of scores which 
are then employed to analyze the results of an experiment. In addition to being able to reduce 
the impact of outliers, a data transformation can be employed to equate heterogeneous group 
variances, as well as to normalize a nonnormal distribution. In point of fact, a transformation 
which results in homogeneity of variance, at the same time often normalizes data. Kirk (1995) 
and Winer et al. (1991) note that another reason for employing a data transformation is to insure 
that certain factorial designs are based on an additive model. A discussion of factorial designs 
and the concept of additivity can be found in the between-subjects factorial analysis of 
variance (Test 27). 

It is not uncommon that, as a result of a data transformation, data that will not yield a 
significant effect may be modified so as to be significant (or vice versa). Because of the latter, 
one might view a data transformation as little more than a convenient mechanism for “cooking” 
data until it allows a researcher to achieve a specific goal. While the latter may be true, when 
used judiciously, data transformation can be a valuable tool. One should consider the fact that 
the selection of the unit of measurement for a dependent variable will always be somewhat 
arbitrary. Howell (1992) cites numerous examples of transformations which are employed within 
various experimental settings, because of the practical or theoretical advantage they provide. 
Illustrative of such transformations are the use of decibels in measuring sound intensity and the 
Richter scale in measuring the magnitude of energy for an earthquake (both of which are based 
on logarithmic data transformations). Another example of a data transformation is a set of test 
scores which are converted into percentiles rather than reporting the original raw scores (i.e., 
number of items correct) for subjects. 

Although, as noted earlier, a data transformation can be misused to distort data to support 
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a hypothesis favored by an experimenter, when employed judiciously, it can be of value. Spe- 
cifically, under certain circumstances it can allow a researcher to provide a more accurate picture 
of the populations under study than will the analysis of untransformed data. Among those 
sources that describe data transformations in varying degrees of detail are Howell (1992), Kirk 
(1995), Myers and Well (1995), Tabachnick and Fidell (1989, 1996), Thóni (1967), and Winer 
et al. (1991). Articles by Games (1983, 1984) and Levine and Dunlap (1982, 1983) address 
some of the controversial issues surrounding data transformation. 

Among the most commonly employed data transformations, all of which can reduce the 
impact of outliers (as well as normalize skewed data and/or produce homogeneity of variance), 
are converting scores into their square root, logarithm (a general discussion of logarithms can 
be found in Endnote 5 in the Introduction), reciprocal, and arcsine. In employing data trans- 
formations, a researcher may find it necessary to compare one or more different transformation 
procedures until he finds the one which best accomplishes the goal he is trying to achieve. 
Common sense would suggest that selection of a data transformation procedure can be based 
upon what it previously has been demonstrated to be successful at. In addition to the latter, Kirk 
(1995) recommends the following protocol for selecting a data transformation procedure: a) 
Apply each of the available transformation procedures to the largest and smallest scores in each 
of the experimental treatments/groups; b) Determine the range of values within each treatment, 
and compute within each treatment the ratio of the largest to the smallest value; and c) Employ 
the transformation procedure that yields the smallest ratio. 

The most commonly employed data transformation procedures will now be discussed and 
demonstrated. 

Square-root transformation A square-root transformation may be useful when the 
mean is proportional to the variance (i.e., the proportion between the mean of a treatment and the 
variance of a treatment is approximately the same for all of the treatments). Under such circum- 
stances the square-root transformation can be effective in normalizing distributions that have a 
moderate positive skew, as well as making the treatment variances more homogeneous. This is 
the case since a square-root scale will reduced the magnitude of difference between the two tails 
of a positively skewed distribution by pulling the right side of the distribution in toward the 
middle. Reaction time is a good example of a measure in psychology that characteristically 
exhibits a strong positive skew (i.e., there are many fast or relatively fast reactors but there are 
also a few slow reactors). Consequently a square-root transformation may be able normalize a 
set of reaction time data. (If the square-root transformation is not successful, a logarithmic or 
reciprocal transformation discussed below may be more suitable for achieving this goal.) In such 
a case the square-root transformation can reduce skewness and stabilize distributional variance. 
Data taken from a Poisson distribution (discussed in Section IX (the Addendum) of the 
binomialsign test for a single sample) are sometimes effectively normalized with a square-root 
transformation. Such data typically consist of frequencies of randomly occurring objects or 
events that have a small probability of occurring over many trials or over a long period of time. 
In Poisson distributed data, the mean and variance are proportional (in fact, they are equal). 

The square-root transformation is obtained through use of the equation Y = yX, where X 
is the original score and Y represents the transformed score. However, when any value of X is 
less than 10, any or all of the following equations are recommended by various sources in place 
of Y - /X , since they are more likely to result in homogeneous variances: a) Y = yX + .5; 


b) Y = /X + /X + I; ando) Y = yX + 375. 


Logarithmic transformation A logarithmic transformation may be useful when the 
mean is proportional to the standard deviation (i.e., the proportion between the mean of a 
treatment and the standard deviation of a treatment is approximately the same for all of the 
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treatments). Under such circumstances the logarithmic transformation can be effective in 
normalizing distributions that have a moderate positive skew, as well as making the treatment 
variances more homogeneous. Since a logarithmic transformation makes a more extreme adjust- 
ment than a square-root transformation, it can be employed to normalize distributions that have 
a more severe positive skew. On alogarithmic scale the distance between adjacent points on the 
scale will be less than the distance between the corresponding points on the original scale of 
measurement. As is the case with the square-root transformation, the logarithmic transformation 
is often useful for normalizing a dependent variable that is a measure of response time. 

The logarithmic transformation is obtained through use of the equation Y = log X. Since 
a logarithm cannot be computed for the value zero, when one or more zeros or positive numbers 
close to zero are present in a set of data, the following equation is employed: Y = log(X + 1). 
Since a logarithm cannot be computed for a negative number, a constant (the value of which is 
a positive number that is minimally greater than one unit above the absolute value of the lowest 
negative number) can be added to all of the values in a set of data, to insure that each value will 
be a positive number. In employing a logarithmic transformation, it does not matter what base 
value is employed for the logarithm — some sources employ the base 10 while others use the 
base e = 2.71828 (see Endnote 5 in the Introduction for a clarification of what a base value of 
a logarithm represents). 


Reciprocal transformation A reciprocal transformation (also referred to as an inverse 
transformation) may be useful when the square of the mean is proportional to the standard 
deviation (1.e., the proportion between the square of the mean of a treatment and the standard 
deviation of a treatment is approximately the same for all of the treatments). Under such 
circumstances the reciprocal transformation can be effective in normalizing distributions that 
have a moderate positive skew, as well as making the treatment variances more homogeneous. 
Since it exerts the most extreme adjustment with regard to normality, the reciprocal 
transformation is often able to normalize data that the square-root and logarithmic 
transformations are unable to normalize. Tabachnick and Fidell (1996) recommend the 
reciprocal transformation for normalizing a J-shaped distribution (i.e., a distribution that looks 
like the letter J or its mirror image — specifically, an extremely skewed unimodal distribution 
that is peaked without a tail at one end, and with a tail falling off toward the other end). 

The reciprocal transformation is obtained through use of the equation Y= 1/X. If any of the 
scores are equal to zero, the equation Y = 1/(X + 1) should be employed. Additional comments 
on the reciprocal transformation can be found at the end of the discussion on data 
transformations. 

At this point some hypothetical data will be employed to demonstrate the square-root, 
logarithmic, and reciprocal transformations. To illustrate the application of the square-root 
transformation, assume we have two groups, with five subjects per group. The interval/ratio 
scores of the subjects in the two groups follow: Group 1: 2, 3, 4, 6, 10; Group 2: 10, 20, 20, 
25, 25. Employing Equations I.1, 1.8, and I.5, the sample means, estimated population standard 
deviations, and estimated population variances are computed tobe X, = 5,5, = 3.16, s? = 10; 
X, = 20, 5, = 6.12, $2 = 37.5. Note that in each group the estimated population variance is 
approximately two times as large as the group mean. Let us assume we wish to make the var- 
lances in the two groups more homogeneous, since the variance of Group 2 is almost four times 
as large as the variance of Group 1.”° Since the estimated population variances and means are 
proportional, we elect to employ a square root transformation. Since some of the scores in Group 
1 are less than 10, we will employ the equation Y = /X +.5 toconverteach score. The resulting 
corresponding transformed scores for the subjects in the two groups are: Group 1: 1.581, 1.871, 
2.121, 2.550, 3.240; Group 2: 3.240, 4.528, 4.528, 5.050, 5.050. Employing Equations I.1 and 
L5 with the transformed scores, the sample means (employing the notation Y) and estimated 
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population variances are computed tobe Y, = 2.27, s? = .42; Y, = 4.48, & = .55. Note that 
the estimated population variances are now almost equal — specifically, the variance of Group 
2 is only 1.31 times larger than the variance of Group 1. When the logarithmic transformation 
is applied to the same set of data (using the equation Y = log X), it also results in more 
homogeneous variances, with the variance of Group | being 2.70 times larger than the variance 
of Group 2. On the other hand, when the reciprocal transformation is applied to the data, it 
increases heterogeneity of variance (with the reciprocal transformed variance of Group 1 being 
88.89 times larger than the reciprocal transformed variance of Group 2). It would appear that the 
square-root transformation is the most effective in equating the variances. 

To illustrate the application of the logarithmic transformation, assume we have two 
groups, with five subjects per group. The interval/ratio scores of the subjects in the two groups 
follow: Group 1: 12, 14, 16, 18, 20; Group 2: 28, 32, 40, 50, 50. Employing Equations I.1, 
I.8 and I.5, the sample means, estimated population standard deviations, and estimated population 
variances are computed tobe X, = 16,5, = 3.16, s? = 10; X, = 40, $, = 10.10, 82 - 102. 
Note that in each group the estimated population standard deviation is between one-fourth to one- 
fifth the size of the group mean. Let us assume we wish to make the variances in the two groups 
more homogeneous, since the variance of Group 2 is about ten times as large as the variance of 
Group 1. Since the estimated population standard deviations and means are proportional, we 
elect to employ a logarithmic transformation (using the equation Y = log X, which employs the 
base 10 for the logarithm). The resulting corresponding transformed scores for the subjects in 
the two groups are: Group 1: 1.079, 1.146, 1.204, 1.255, 1.301; Group 2: 1.447, 1.505, 1.602, 
1.699, 1.699. Employing Equations I.1 and L5 with the transformed scores, the sample means 
opes the notation Y) and estimated population variances are computed to be Y = 1.197, 

.008; Y, = 1. 590, $2 .013. Note that the ratio of the larger variance (5 = .013) to 
de smaller variance (5? = 008) i is now only 1.68. When the square-root transformation is 
applied to the same set of data (using the equation Y = /X, since none of the scores is less than 
10) it also results in more homogeneous variances, with the variance of Group 2 being 4.13 times 
larger than the variance of Group 1. The reciprocal transformation applied to the same data 
also results in more homogeneous variances, with the variance of Group | being 3.47 times larger 
than the variance of Group 2. It would appear that although all three transformations make the 
variances more homogeneous, the logarithmic transformation is the most effective. 

To illustrate the application of the reciprocal transformation, assume we have two groups, 
with five subjects per group. The interval/ratio scores of the subjects in the two groups follow: 
Group 1: 2, 3, 4, 6, 10; Group 2: 1, 1, 3, 5, 90. Employing Equations I.1, I.8 and I.5, the 
sample means, estimated population standard deviations, and estimated population variances are 
computed to be X, = 5, 5, = 3.16, x = 10; X, = 20, 5, = 39.17, $2 - 1534. Note that in 
each group the square of the mean is between 8 to 10 times the size of the estimated population 
standard deviation. Let us assume we wish to make the variances in the two groups more 
homogeneous, since the variance of Group 2 is approximately 153 times as large as the variance 
of Group 1. Since the square of the means and the estimated standard deviations are proportional, 
we elect to employ a reciprocal transformation (using the equation Y = 1/X). The resulting 
corresponding transformed scores for the subjects in the two groups are: Group 1: .5, .333, .25, 
.167, .1; Group 2: 1, 1, .333, 2, .011. Employing Equations I.1 and L5 with the transformed 
scores, the anie means and estimated population variances are computed to be Y, = .27, 

.024; Y, - .51, & .21. Note that although the estimated population variances are still 
iot un they are Sondderabls closer than the values computed for the untransformed data — 
specifically, as a result of the transformation the variance of Group 2 is 8.92 times larger than the 
variance of Group 1. When the square-root transformation (using the equation Y = yX + .5, 
since some of the scores are less than 10) is applied to the same set of data, the variance of Group 
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2 is 29.92 times larger than the variance of Group 1. When the logarithmic transformation is 
applied to the same set of data (using the equation Y = log X), the variance of Group 2 is 8.8 
times larger than the variance of Group 1. It would appear that although none of the three 
transformations is able to result in homogeneous group variances, the logarithmic and reciprocal 
transformations come closest to achieving that goal. In the case of this latter example, the score 
of 90 in Group 2 would appear to be a possible outlier, and, as such, might not be included in the 
data or have its impact altered in some way through use of some method of accomodation 
designed to reduce the impact of an outlier. It should also be noted that use of the degree of 
proportionality between the means, squared means, standard deviations, and/or variances will not 
be the only factors that can be employed to determine what transformation will be most effective. 
Often, through use of trial and error, a researcher can investigate which, if any, transformation 
will best achieve the desired goal. 


Arcsine (arcsin) transformation An arcsine transformation (also referred to as an 
angular or inverse sine transformation) involves the use of a trigonometric function which is 
able to transform a proportion between 0 and 1 (or percentage between 0% and 100%) into 
an angle expressed in radians (1 radian = 57.3 degrees, which is equal to 180°/z, and one degree 
equals .01745 radians). The arcsine of a number is the angle whose sine is that number. 
Although some books contain tables of arcsine values, an arcsine can be computed on many 
calculators through use of the sin! key. An arcsine transformation may be useful for 
normalizing distributions when the means and variances are proportional, and the distributions 
are binomially distributed. Howell (1992) notes that although both the square-root and arcsine 
transformations are suitable when the means and variances are proportional, whereas the square- 
root transformation compresses the upper tail of the distribution, the arcsine transformation 
flattens the distribution by stretching out both tails. 

The arcsine transformation is obtained through use of the equation Y = 2 arcsiny X, 
where X will be a proportion between 0 and 1. Based on a paper by Bartlett (1947), sources 
recommend that the equation Y = 2 arcsin/X + (1/2n) (or Y = 2 arcsin/X + (1/4n)) be 
employed when the value of X is equal to 0, and that the equation Y = 2 arcsiny X - (1/27) (or 
Y = 2 arcsin/ X - (1/4n)) be employed when the value of X is equal to 1 (where n is the 
number of observations in the treatment for which a proportion is computed). All of the afore- 
mentioned equations yield a value of Y that is expressed in radians. To illustrate the arcsine 
transformation, assume we have the following five values which represent the proportion of 
bulbs that bloom in each of five flower beds: .25, .39, .5, .68, .75. When the latter values are 
employed in the equation Y - 2 arcsiny X, the following values (in radians) are computed for 
Y: 1.0472, 1.3490, 1.5708, 1.9391, 2.0944. The possible range of values that Y can equal, 
through use of any of the equations noted above, is 0 radians (for a proportion of zero) to 3.1416 
radians (for a proportion of 1) (which is equal to pi). 

Some sources (e.g., Myers and Well (1995), Rao (1998) and Zar (1999)) employ the 
following alternative arcsine transformation equation which expresses the value of Y in degrees: 
Y = arcsin/ X. At the conclusion of the data analysis, the transformed values can be converted 
back into the original proportions through use of the equation X = (sin Y)’. To illustrate the 
arcsine transformation using the alternative equation, assume we have the same five values 
representing the proportion of bulbs that bloom in each of five flower beds (i.e., .25, .39, .5, .68, 
75). When the latter values are employed in the equation Y = arcsiny X, the following values 
(in degrees) are computed for Y: 30, 38.65, 45, 55.55, 60.7 Zar (1999) notes additional alter- 
native equations for computing the value of Y in degrees. The possible range of values that Y can 
equal, through use of any of the equations that compute an arcsine in degrees, is 0? (for a pro- 
portion of zero) to 90? (for a proportion of 1). Additional information on the arcsine 
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transformation, as well as examples illustrating its application, can be found in Rao (1998) and 
Zar (1999). 


Y - X? transformation Zar (1999) notes that if there is an inverse relationship between 
the treatment standard deviations and the treatment means (i.e., as the standard deviations 
increase, the means decrease) and/or a distribution is negatively skewed, the following trans- 
formation may be useful for normalizing data and creating homogeneity of variance: Y - X?. 


Final comments on data transformation Some data transformations, such as a reciprocal 
transformation, may result in a reversal in the direction of the scores. In other words, if we have 
two scores a and b with a » b, if we obtain the reciprocal of both, the reciprocal of b will be 
greater than the reciprocal of a. The process of reversing the direction to restore the original 
ordinal relationship between the scores is called reflection. Reflection can also be used to 
convert negatively skewed data into positively skewed data. Tabachnick and Fidell (1996) note 
the latter can be accomplished by doing the following: a) Create a constant that is larger than all 
of the scores in a distribution by adding the value 1 to the largest score in the distribution; 
b) Subtract each score in the distribution from the constant value. At this point the converted 
data will have a positive skew and the appropriate transformation for normalizing positively 
skewed data (e.g., square-root, logarithmic, and reciprocal transformations) can be employed. 
After the appropriate statistical test has been employed with the normalized data, the researcher 
must remember to take into account the reversal in the direction of scoring in interpreting the 
results. 

In the final analysis, as with anything else, a data transformation should be judged on the 
basis of its practical consequences. Specifically, if through use of a data transformation a sig- 
nificant result is obtained, and that result can be consistently replicated employing the same data 
transformation, a researcher can conclude that one is dealing with a reliable phenomenon. If the 
data obtained through use of a data transformation proves to be useful in a practical or theoretical 
sense, it is as valuable as data which when analyzed does not require any sort of transformation. 


4. Hotelling's T? The multivariate analog of the t test for two independent samples is 
Hotelling's T? (Hotelling (1931)), which is one of a number multivariate statistical procedures 
discussed in the book. The term multivariate is employed in reference to procedures that evaluate 
experimental designs in which there are multiple independent variables and/or multiple 
dependent variables. Hotelling's T?, which is a special case of the multivariate analysis of 
variance (MANOVA) (discussed in Section VII of the single-factor between-subjects analysis 
of variance), can be employed to analyze the data for an experiment that involves a single inde- 
pendent variable comprised of two levels and multiple dependent variables. With regard to the 
latter, instead of a single score, each subject produces scores on two or more dependent variables. 
To illustrate, let us assume that in Example 11.1 two scores are obtained for each subject. One 
Score represents a subject's level of depression and a second score represents the subject's level 
of anxiety. Within the framework of Hotelling's T?, a composite mean based on both the 
depression and anxiety scores of subjects is computed for each group. The latter composite 
means are referred to as mean vectors or centroids. As is the case with the ¢ test for two in- 
dependent samples, the means (in this case composite) for the two groups are then compared 
with one another. For a discussion of the advantages of employing multiple dependent variables 
in a study, the reader should refer to the discussion of the multivariate analysis of variance in 
Section VII of the single-factor between-subjects analysis of variance). Like most multivariate 
procedures, the mathematics involved in conducting Hotelling’s T? is quite complex, and for this 
reason it becomes laborious if not impractical to implement without the aid of a computer. Since 
a full description of Hotelling’s T? is beyond the scope of this book, the interested reader should 
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consult sources such Stevens (1986, 1996) and Tabachnick and Fidell (1989, 1996) which 
describe multivariate procedures in detail. 


VIII. Additional Examples Illustrating the Use of the ¢ Test for 
Two Independent Samples 


Two additional examples that can be evaluated with the ¢ test for two independent samples are 
presented in this section. Since Examples 11.4 and 11.5 employ the same data set as that em- 
ployed in Example 11.1, they yield the identical result. 


Example 11.4 A researcher wants to assess the relative effect of two different kinds of punish- 
ment (loud noise versus a blast of cold air) on the emotionality of mice. Each of ten mice is 
randomly assigned to one of two groups. During the course of the experiment each mouse is 
sequestered in an experimental chamber. While in the chamber, each of the five mice in Group 
1 is periodically presented with a loud noise, and each of the five mice in Group 2 is periodically 
presented with a blast of cold air. The presentation of the punitive stimulus for each of the 
animals is generated by a machine that randomly presents the stimulus throughout the duration 
of the time the mouse is in the chamber. The dependent variable of emotionality employed in the 
study is the number of times each mouse defecates while in the experimental chamber. The 
number of episodes of defecation for the 10 mice follow: Group 1: 11, 1,0, 2, 0; Group 2: 11, 
11, 5, 8, 4. Do subjects exhibit differences in emotionality under the different experimental 
conditions ? 


In Example 11.4, if the one-tailed alternative hypothesis H,: y, < p, is employed it can 
be concluded that the group presented the blast of cold air (Group 2) obtains a significantly 
higher emotionality score than the group presented with loud noise (Group 1). This is the case, 
since the computed value t = —1.96 indicates that the average defecation score of Group 2 is 
significantly higher than the average defecation score of Group 1. As is the case in Example 
11.1, the nondirectional alternative hypothesis H,: u, # p, is not supported, and thus, if the 
latter alternative hypothesis is employed one cannot conclude that the blast of cold air results in 
higher emotionality. 


Example 11.5 Each of two companies that manufacture the same size precision ball bearing 
claims it has better quality control than its competitor. A quality control engineer conducts a 
study in which he compares the precision of ball bearings manufactured by the two companies. 
The engineer randomly selects five ball bearings from the stock of Company A and five ball 
bearings from the stock of Company B. He measures how much the diameter of each of the ten 
ball bearings deviates from the manufacturer's specifications. The deviation scores (in 
micrometers) for the ten ball bearings manufactured by the two companies follow: Company 
A: 11,1, 0, 2, 0; Company B: 11, 11, 5, 8, 4. What can the engineer conclude about the 
relative quality control of the two companies? 


In Example 11.5, if the one-tailed alternative hypothesis H,: p, < p, is employed it can 
be concluded that Company B obtains a significantly higher deviation score than Company A. 
This will allow the researcher to conclude that Company A has superior quality control. As is 
the case in Example 11.1, the nondirectional alternative hypothesis H,: p, * p, is not 
supported, and thus, if the latter alternative hypothesis is employed, the researcher cannot 
conclude that Company A has superior quality control. 
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Endnotes 


1. Alternative terms that are commonly used to describe the different samples employed in an 
experiment are groups, experimental conditions, and experimental treatments. 
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10. 


It should be noted that there is a design in which different subjects serve in each of the k 
experimental conditions that is categorized as a dependent samples design. In a depen- 
dent samples design each subject either serves in all of the k experimental conditions, or 
else is matched with a subject in each of the other (k — 1) experimental conditions. When 
subjects are matched with one another they are equated on one or more variables that are 
believed to be correlated with scores on the dependent variable. The concept of matching 
and a general discussion of the dependent sample design can be found under the ¢ test for 
two dependent samples (Test 17). 


An alternative but equivalent way of writing the null hypothesis is Hj: p; - p, = 0. The 
analogous alternative but equivalent ways of writing the alternative hypotheses in the order 
they are presented are: Hi: p, - m * 0, H: p, -m > 0, H: p, -m < O0. 


In order to be solvable, an equation for computing the £ statistic requires that there is 
variability in the scores of at least one of the two groups. If all subjects in Group 1 have 
the same score and all subjects in Group 2 have the same score, the values computed for 
the estimated population variances will equal zero (1.e., s? = & = 0). If the latter is true 
the denominator of any of the equations to be presented for computing the value of t will 
equal zero, thus rendering a solution impossible. 


S 22 ~ 2: 
When n, = n, $, =y (Sy + $2)2. 


The actual value that is estimated by Sx oy is Ox gs which is the standard deviation of 
the sampling distribution of the difference scores for the two populations. The meaning of 
the standard error of the difference can be best understood by considering the following 
procedure for generating an empirical sampling distribution of difference scores: a) Obtain 
a random sample of n, scores from Population 1 and a random sample of n, scores from 
Population 2; b) Compute the mean of each sample; c) Obtain a difference score by 
subtracting the mean of Sample 2 from the mean of Sample 1 — i.e., X, - X, = D; and 
d) Repeat steps a) through c) m times. At the conclusion of this procedure one will have 
obtained m difference scores. The standard error of the difference represents the 
standard deviation of the m difference scores, and can be computed by using Equation 
1.8/2.1. Thus: Sy x, = y [GCD? - (XDy/m)|/[n - 1]. The standard deviation that is 


computed with the aforementioned equation is an estimate of Ox 
1 





DS 


Equation 11.4 can also be written in the form df = (n, - 1) + (n, - 1), which reflects 
the number of degrees of freedom for each of the groups. 


The absolute value of t is employed to represent t in the summary statement. 


The F ax test is one of a number of statistical procedures that are named after the English 
statistician Sir Ronald Fisher. Among Fisher's contributions to the field of statistics was 
the development of a sampling distribution referred to as the F distribution (which bears 
the first letter of his surname). The values in the F ax distribution are derived from the F 
distribution. 

A tabled F ,,. value is the value below which 97.5% of the F distribution falls and above 
which 2.5% of the distribution falls. A tabled F ọọ; value is the value below which 99.5% 


.995 
of the F distribution falls and above which .5% of the distribution falls. 
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11. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


In Table A9 the value Fax = 23.2 is the result of rounding off F ọọ; = 23.15. 


X 1 


A tabled F 5 value is the value below which 2.5% of the F distribution falls and above 


which 97.596 of the distribution falls. A tabled F5 Value is the value below which .5% 
of the F distribution falls and above which 99.5% of the distribution falls. 


Most sources only list values in the upper tail of the F distribution. The values F ọ = .157 
and F = .063 are obtained from Guenther (1965). It so happens that when 
df am = Hien» the value of Fs. can be obtained by dividing 1 by the value of F,.. Thus: 
1/6.39 = .157. Inthe same respect the value of F ,, can be obtained by dividing 1 by the 
value of F ọọ. Thus: 1/15.98 = .063. 


Whenn =n, = n,andt' = f, = t,thet' value computed with Equation 11.9 will equal 
the tabled critical t value for df = n — 1. When n, + n,, the computed value of f’ will fall 
in between the values of f, and t,. It should be noted, that the effect of violation of the 
homogeneity of variance assumption on the f test statistic decreases as the number of sub- 
jects employed in each of the samples increases. This can be demonstrated in relation to 
Equation 11.9, in that if there are a large number of subjects in each group the value that 
is employed for both 7, and t, in Equation 11.9 is t,, = 1.96. The latter tabled critical 
two-tailed .05 value, which is also the tabled critical two-tailed .05 value for the normal dis- 
tribution, is the value that is computed for t’. Thus, in the case of large sample sizes the 
tabled critical value for df - n, * n, - 2 will be equivalent to the value computed for 
df - n, - land df - n, - 1. 


The treatment effect described in this section is not the same thing as Cohen's d index 
(the effect size computed with Equation 11.10). However, if a hypothesized effect size is 
present in a set of data, the computed value of d can be used as a measure of treatment 
effect. In such an instance, the value of d will be positively correlated with the value of the 
treatment effect described in this section. Cohen (1988, pp. 24—27) describes how the d 
index can be converted into the type of correlational treatment effect measure that is 
discussed in this section. Endnote 18 discusses the relationship between the d index and 
the omega squared statistic presented in this section in more detail. 


It should be noted, however, that the degree of error associated with a measure of treatment 
effect will decrease as the size of the sample employed to compute the measure increases. 


The reader familiar with the concept of correlation can think of a measure of treatment 
effect as a correlational measure which provides information analogous to that provided by 
the coefficient of determination (designated by the notation r?), which is the square of 
the Pearson product-moment correlation coefficient. The coefficient of determination 
(which is discussed in more detail in Section V of the Pearson product-moment corre- 
lation coefficient) measures the degree of variability on one variable that can be accounted 
for by variability on a second variable. This latter definition is consistent with the 
definition that is provided in this section for a treatment effect. 


a) In actuality, Cohen (1977, 1988) employs the notation for eta squared (which is dis- 
cussed briefly in the next paragraph and in greater detail in Section VI of the single-factor 
between-subjects analysis of variance) in reference to the aforementioned effect size 
values. Endnote 58 in the single-factor between-subjects analysis of variance clarifies 
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19. 


20. 


21. 


22. 


23. 


24. 


25. 


Cohen's (1977, 1988) use of eta squared and omega squared to represent the same 
measure; b) Cohen (1977, 1988, pp. 23-27) states that the small, medium, and large effect 
size values of .0099, .0588, and .1379, are equivalent to the values .2, .5, and .8 for his d 
index (which was discussed previously in the section on statistical power). In point of fact, 
the values .2, .5, and .8 represent the minimum values for a small, medium, and large effect 
size for Cohen's d index. The conversion of an omega squared/eta squared value into 
the corresponding Cohen's d index value is described in Section IX (the Addendum) of 
the Pearson product-moment correlation coefficient under the discussion of meta- 
analysis and related topics. 


This result can also be written as: -.89 < (u, - uj) < 10.89. 


In instances where, in stating the null hypothesis, a researcher stipulates that the difference 
between the two population means is some value other than zero, the numerator of Equation 
11.16 is the same as the numerator of Equation 11.5. The protocol for computing the value 
of the numerator is identical to that employed for Equation 11.5. 


The general issues discussed in this section are relevant to any case in which a parametric 
and nonparametric test can be employed to evaluate the same set of data. 


Barnett and Lewis (1994) note that the presence of an outlier may not always be obvious 
as a result of visual inspection of data. Typically, the more complex the structure of data, 
the more difficult it becomes to visually detect outliers. Regression analysis and 
multivariate analysis are cited as examples of data analysis where visual detection of 
outliers is often difficult. 


If, as a result of the presence of one or more outliers, the difference between the group 
means is also inflated, the use of a more conservative test will, in part, compensate for this 
latter effect. The impact of outliers on the ¢ test for two independent samples is 
discussed by Zimmerman and Zumbo (1993). The latter authors note that the presence of 
outliers in a sample may decrease the power of the ¢ test to such a degree that the Mann- 
Whitney U test (which is the rank-order nonparametric analog of the t test for two 
independent samples) will be a more powerful test for comparing two independent 
samples. 


Barnett and Lewis (1994, p. 84) note that the use of the median absolute deviation as a 
measure of dispersion/variability can be traced back to the 19th century to the great German 
mathematician, Johann Karl Friedrich Gauss. Barnett and Lewis (1994, p. 156) state that 
although the median absolute deviation is a less efficient measure of dispersion than the 
standard deviation, it is a more robust estimator (especially for nonnormally distributed 
data). 


Samples in which data have been deleted or modified are sometimes referred to as cen- 
sored samples (Barnett and Lewis (1994, p. 78). The term censoring, however, is most 
commonly employed in reference to studies where scores are not available for some of the 
subjects, since it is either not desirable or possible to follow each subject until the con- 
clusion of a study. This latter type of censored data is most commonly encountered in 
medical research when subjects no longer make themselves available for study, or a 
researcher is unable to locate subjects beyond a certain period of time. Good (1994, p. 117) 
notes that another example of censoring occurs when, within the framework of evaluating 
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27. 


a variable, the measurement breaks down at some point on the measurement continuum 
(usually at an extreme point). Consequently, one must employ approximate scores instead 
of exact scores to represent the observations that cannot be measured with precision. Two 
obvious options that can be employed to negate the potential impact of censored data are: 
a) Use of the median in lieu of the mean as a measure of central tendency; and b) 
Employing an inferential statistical test that uses rank-orders instead of interval/ratio scores. 


Among the sources that describe ways of dealing with censored data are Good (1994), 
Hollander and Wolfe (1999), Pagano and Gauvreau (1993), and Rosner (1995). The latter 
three references all discuss the Kaplan-Meier method/product-limit estimator (1958), 
which is a procedure that deals with censored data in estimating survival probabilities 
within the framework of medical research. Sprent (1993) also discusses censored data 
within the context of describing the Gehan-Wilcoxon test for censored data (developed 
by Gehan (19652, 1965b)), a procedure for evaluating censored data in a design involving 
two independent samples. 


In this example, as well as other examples in this section, use of the F sax test may not yield 
a significant result (1.e., it may not result in the conclusion that the population variances are 
heterogeneous). The intent of the examples, however, is only to illustrate the variance 
stabilizing properties of the transformation methods. 


If the relationship 1 radian = 57.3 degrees is applied for a specific proportion, the number 
of degrees computed with the equation Y - arcsin/X will not correspond to the number 
of radians computed with the equation Y - 2 arcsin/ X. Nevertheless, if the transformed 
data derived from the two equations are evaluated with the same inferential statistical test, 
the same result is obtained. In point of fact, if the equation Y = arcsin/X is employed to 
derive the value of Y in radians, and the resulting value is multiplied by 57.3, it will yield 
the same number of degrees obtained when that equation is used to derive the value of Y 
in degrees. Since the multiplication of arcsin/X by 2 in the equation Y = 2 arcsin/X 
does not alter the value of the ratio for the difference between means versus pooled 
variability (or other relevant parameters being estimated within the framework of a 
statistical test), it yields the same test statistic regardless of which equation is employed. 
The author is indebted to Jerrold Zar for clarifying the relationship between the arcsine 
equations. 
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Test 12 


Mann-Whitney U Test 
(Nonparametric Test Employed with Ordinal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Do two independent samples represent two populations with 
different median values (or different distributions with respect to the rank-orderings of the scores 
in the two underlying population distributions)? 


Relevant background information on test The Mann-Whitney U test is employed with 
ordinal (rank-order) data in a hypothesis testing situation involving a design with two 
independent samples. If the result of the Mann-Whitney U test is significant, it indicates there 
is a significant difference between the two sample medians, and as a result of the latter the 
researcher can conclude there is a high likelihood that the samples represent populations with 
different median values. 

Two versions of the test to be described under the label of the Mann-Whitney U test were 
independently developed by Mann and Whitney (1947) and Wilcoxon (1949). The version to be 
described here is commonly identified as the Mann-Whitney U test, while the version developed 
by Wilcoxon (1949) is usually referred to as the Wilcoxon-Mann- Whitney test.! Although they 
employ different equations and different tables, the two versions of the test yield comparable 
results. In employing the Mann-Whitney U test, one of the following is true with regard to the 
rank-order data that are evaluated: a) The data are in a rank-order format, since it is the only 
format in which scores are available; or b) The data have been transformed into a rank-order 
format from an interval/ratio format, since the researcher has reason to believe that the normality 
assumption (as well as, perhaps, the homogeneity of variance assumption) of the ¢ test for two 
independent samples (Test 11) (which is the parametric analog of the Mann-Whitney U test) 
is saliently violated. It should be noted that when a researcher elects to transform a set of 
interval/ratio data into ranks, information is sacrificed. This latter fact accounts for why there is 
reluctance among some researchers to employ nonparametric tests such as the Mann-Whitney 
U test, even if there is reason to believe that one or more of the assumptions of the ¢ test for two 
independent samples have been violated. 

Various sources (e.g. Conover (1980, 1999), Daniel (1990), and Marascuilo and 
McSweeney (1977)) note that the Mann-Whitney U test is based on the following assumptions: 
a) Each sample has been randomly selected from the population it represents; b) The two samples 
are independent of one another; c) The original variable observed (which is subsequently ranked) 
is a continuous random variable. In truth, this assumption which is common to many nonpar- 
ametric tests, is often not adhered to, in that such tests are often employed with a dependent 
variable that represents a discrete random variable; and d) The underlying distributions from 
which the samples are derived are identical in shape. The shapes of the underlying population 
distributions, however, do not have to be normal. Maxwell and Delaney (1990) point out the 
assumption of identically shaped distributions implies equal dispersion of data within each 
distribution. Because of this, they note that like the £ test for two independent samples, the 
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Mann-Whitney U test also assumes homogeneity of variance with respect to the underlying 
population distributions. Because the latter assumption is not generally acknowledged for the 
Mann-Whitney U test, itis not uncommon for sources to state that violation of the homogeneity 
of variance assumption justifies use of the Mann-Whitney U test in lieu of the ¢ test for two 
independent samples. It should be pointed out, however, that there is some empirical evidence 
which suggests that the sampling distribution for the Mann-Whitney U test is not as affected 
by violation of the homogeneity of variance assumption as is the sampling distribution for t test 
for two independent samples. One reason cited by various sources for employing the Mann- 
Whitney U test, is that by virtue of ranking interval/ratio data a researcher will be able to reduce 
or eliminate the impact of outliers. As noted in Section VII of the £ test for two independent 
samples, since outliers can dramatically influence variability, they can be responsible for hetero- 
geneity of variance between two or more samples. In addition, outliers can have a dramatic 
impact on the value of a sample mean. 


II. Example 


Example 12.1 is identical to Example 11.1 (which is evaluated with the £ test for two inde- 
pendent samples). In evaluating Example 12.1 it will be assumed that the interval/ratio data are 
rank-ordered, since one or more of the assumptions of the ¢ test for two independent samples 
have been saliently violated. 


Example 12.1  /n order to assess the efficacy of a new antidepressant drug, ten clinically 
depressed patients are randomly assigned to one of two groups. Five patients are assigned to 
Group 1, which is administered the antidepressant drug for a period of six months. The other 
five patients are assigned to Group 2, which is administered a placebo during the same six-month 
period. Assume that prior to introducing the experimental treatments, the experimenter 
confirmed that the level of depression in the two groups was equal. After six months elapse all 
ten subjects are rated by a psychiatrist (who is blind with respect to a subject's experimental 
condition) on their level of depression. The psychiatrist's depression ratings for the five subjects 
in each group follow (the higher the rating, the more depressed a subject): Group 1: 11, 1, 0, 
2, 0; Group 2: 11,11, 5, 8, 4. Do the data indicate that the antidepressant drug is effective? 


III. Null versus Alternative Hypotheses 


Null hypothesis Hy 0, = 0, 


(The median of the population Group 1 represents equals the median of the population Group 2 
represents. With respect to the sample data, when both groups have an equal sample size, this 
translates into the sum of the ranks of Group 1 being equal to the sum of the ranks of Group 2 
(.e., XR, = ER,). A more general way of stating this, which also encompasses designs involv- 
ing unequal sample sizes, is that the means of the ranks of the two groups are equal (i.e., 
R, = R,). 


Alternative hypothesis H: 0, # 0, 


(The median of the population Group 1 represents does not equal the median of the population 
Group 2 represents. With respect to the sample data, when both groups have an equal sample 
size, this translates into the sum of the ranks of Group 1 not being equal to the sum of the ranks 
of Group 2 (ie., ZR, + LR,). A more general way of stating this, which also encompasses 
designs involving unequal sample sizes, is that the means of the ranks of the two groups are not 
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equal (i.e., R, + R, ). This is a nondirectional alternative hypothesis and it is evaluated with 
a two-tailed test.) 


or 
H: 0, > 0, 


(The median of the population Group 1 represents is greater than the median of the population 
Group 2 represents. With respect to the sample data, when both groups have an equal sample size 
(so long as a rank of 1 is given to the lowest score), this translates into the sum of the ranks of 
Group 1 being greater than the sum of the ranks of Group 2 (i.e., XR, > YR, ). A more general 
way of stating this (which also encompasses designs involving unequal sample sizes) is that the 
mean of the ranks of Group 1 is greater than the mean of the ranks of Group 2 (i.e, R) > R,). 
This is a directional alternative hypothesis and it is evaluated with a one-tailed test.) 


or 
H,: 9, < 9, 


(The median of the population Group | represents is less than the median of the population 
Group 2 represents. With respect to the sample data, when both groups have an equal sample 
size (so long as a rank of 1 is given to the lowest score), this translates into the sum of the ranks 
of Group 1 being less than the sum of the ranks of Group 2 (i.e., LR, < AXR,). A more general 
way of stating this (which also encompasses designs involving unequal sample sizes) is that the 
mean of the ranks of Group 1 is less than the mean of the ranks of Group 2 (i.e., R} < R,). This 
is a directional alternative hypothesis and it is evaluated with a one-tailed test.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected. 


IV. Test Computations 


The data for Example 12.1 are summarized in Table 12.1. The total number of subjects 
employed in the experiment is N = 10. There are n, = 5 subjects in Group 1 and n, = 5 
subjects in Group 2. The original interval/ratio scores of the subjects are recorded in the columns 
labelled X, and X,. The adjacent columns R, and R, contain the rank-order assigned to each 
of the scores. The rankings for Example 12.1 are summarized in Table 12.2. The ranking 
protocol for the Mann-Whitney U test is described in this section. Note that in Table 12.1 and 
Table 12.2 each subject's identification number indicates the order in Table 12.1 in which a 
subject's score appears in a given group, followed by his/her group. Thus, Subject i, j is the i” 
subject in Group j. 

The following protocol, which is summarized in Table 12.2, is used in assigning ranks. 

a) All N= 10 scores are arranged in order of magnitude (irrespective of group membership), 
beginning on the left with the lowest score and moving to the right as scores increase. This is 
done in the second row of Table 12.2. 

b) In the third row of Table 12.2, all N = 10 scores are assigned a rank. Moving from left 
to right, a rank of 1 is assigned to the score that is furthest to the left (which is the lowest score), 
a rank of 2 is assigned to the score that is second from the left (which, if there are no ties, will 
be the second lowest score), and so on until the score at the extreme right (which will be the 
highest score) is assigned a rank equal to N (if there are no ties for the highest score). 

c) The ranks in the third row of Table 12.2 must be adjusted when there are tied scores 
present in the data. Specifically, in instances where two or more subjects have the same score, 
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Table 12.1 Data for Example 12.1 


Group 1 Group 2 
X, R, X, R, 
Subject 1,1 11 9 Subject 1,2 11 9 
Subject 2,1 1 3 Subject 2,2 11 9 
Subject 3,1 0 1.5 Subject 3,2 5 6 
Subject 4,1 2 4 Subject 4,2 8 7 
Subject 5,1 0 1.5 Subject 5,2 4 5 
XR, = 19 XR, - 36 
E ER = YR 
R,- —1-B- 38 Roll 239 2004 
n, 5 n, 5 


Table 12.2 Rankings for the Mann-Whitney U Test for Example 12.1 
Subject identification number 3,1 5,1 2,1 4,1 52 32 42 1,1 12 22 


Depression score 0 0 1 2 4 5 8 11 11 11 
Rank prior to tie adjustment 1 2 3 4 5 6 T 8 9 10 
Tie-adjusted rank 1.5 1.5 3 4 5 6 7 9 9 9 


the average of the ranks involved is assigned to all scores tied for a given rank. This adjustment 
is made in the fourth row of Table 12.2. To illustrate: Both Subjects 3,1 and 5,1 have a score 
of 0. Since the two scores of 0 are the lowest scores out of the total of ten scores, in assigning 
ranks to these scores we can arbitrarily assign one of the 0 scores a rank of 1 and the other a rank 
of 2. However, since both of these scores are identical it is more equitable to give each of them 
the same rank. To do this, we compute the average of the ranks involved for the two scores. 
Thus, the two ranks involved prior to adjusting for ties (i.e., the ranks 1 and 2) are added up and 
divided by two. The resulting value (1 + 2)/2 = 1.5 is the rank assigned to each of the subjects 
who is tied for 0. There is one other set of ties present in the data which involves three subjects. 
Subjects 1,1, 1,2, and 2,2 all obtain a score of 11. Since the ranks assigned to these three scores 
prior to adjusting for ties are 8, 9, and 10, the average of the three ranks (8 + 9 + 10/3 = 9 is 
assigned to the scores of each of the three subjects who obtain a score of 11. 

Although it is not the case in Example 12.1, it should be noted that any time each set of ties 
involves subjects in the same group, the tie adjustment will result in the identical sum and 
average for the ranks of the two groups that will be obtained if the tie adjustment is not 
employed. Because of this, under these conditions the computed test statistic will be identical 
regardless of whether or not one uses the tie adjustment. On the other hand, when one or more 
sets of ties involve subjects from both groups, the tie-adjusted ranks will yield a value for the test 
statistic that will be different from that which will be obtained if the tie adjustment is not 
employed. In Example 12.1, although the two subjects who obtain a score of zero happen to be 
in the same group, in the case of the three subjects who have a score of 11, one subject is in 
Group 1 and the other two subjects are in Group 2. 

If the ranking protocol described in this section is used with Example 12.1, and the re- 
searcher elects to employ a one-tailed alternative hypothesis, the directional alternative 
hypothesis H,: 0, < 0,isemployed. The latter directional alternative hypothesis is employed, 
since it predicts that Group 1, the group that receives the antidepressant, will have a lower 
median score, and thus a lower sum of ranks/average rank (both of which are indicative of a 
lower level of depression) than Group 2. 
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It should be noted that it is permissible to reverse the ranking protocol described in this 
section. Specifically, one can assign a rank of 1 to the highest score, a rank of 2 to the second 
highest score, and so on, until reaching the lowest score which is assigned a rank equal to the 
value of N. Although this reverse ranking protocol will yield the identical Mann-Whitney test 
statistic as the ranking protocol described in this section, it will result in ranks that are the oppo- 
site of those obtained in Table 12.2. If the protocol employed in ranking is taken into account in 
interpreting the results of the Mann-Whitney U test, both ranking protocols will lead to identical 
conclusions. Since it is less likely to cause confusion in interpreting the test statistic, it is recom- 
mended that the original ranking protocol described in this section be employed — 1.e., assigning 
arank of 1 to the lowest score and a rank equivalent to the value of N to the highest score. In view 
of this, in all future discussion of the Mann-Whitney U test, as well as other tests that involve 
rank-ordering data, it will be assumed (unless otherwise stipulated) that the ranking protocol 
employed assigns a rank of 1 to the lowest score and a rank of N to the highest score. 

Once all of the subjects have been assigned a rank, the sum of the ranks for each of the 
groups is computed. These values, XR, - 19 and XR, - 36, are computed in Table 12.1. 
Upon determining the sum of the ranks for both groups, the values U, and U, are computed 
employing Equations 12.1 and 12.2. 


n(n, + 1) 


U = nn, + z - XR (Equation 12.1) 
n(n, + 1 
U, = nn, + “i” - XR, (Equation 12.2) 


Employing Equations 12.1 and 12.2, the values U, = 21 and U, = 4 are computed. 


- 19 = 21 


U, = 00 + 2872 


U, = (5) + 2 


20 * D _ 36 = 4 
2 


Note that U, and U, can never be negative values. If a negative value is obtained for either, 
it indicates an error has been made in the rankings and/or calculations. 
Equation 12.3 can be employed to confirm that the correct values have been computed for U, 
and U,. 
nn =U + U, (Equation 12.3) 


If the relationship in Equation 12.3 is not confirmed, it indicates that an error has been made 
in ranking the scores or in the computation of the U values. The relationship described by 
Equation 12.3 is confirmed below for Example 12.1. 

(5)(5) = 21+4=25 
V. Interpretation of the Test Results 
The smaller of the two values U, versus U, is designated as the obtained U statistic. Since 


U, = 4 is smaller than U, = 21, the value of U=4. The value of U is evaluated with Table 
A11 (Table of Critical Values for the Mann-Whitney U Statistic) in the Appendix. In Table 
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A11, the critical U values are listed in reference to the number of subjects in each group.” For n=5 
and n, = 5, the tabled critical two-tailed .05 and .01 values are U,, = 2 and U,, = 0, and the 
tabled critical one-tailed .05 and .01 values are U,, = 4 and U,, = 1. In order to be 
significant, the obtained value of U must be equal to or less than the tabled critical value at the 
prespecified level of significance. 

Since the obtained value U = 4 must be equal to or less than the aforementioned tabled 
critical values, the null hypothesis can only be rejected if the directional alternative hypothesis 
H: 0, < 0, is employed. The directional alternative hypothesis H,: 0, < 0, is supported 
at the .05 level, since U = 4 is equal to the tabled critical one-tailed value U,, = 4. The data 
are consistent with the directional alternative hypothesis H,: 0, < 0,, since the average of the 
ranks in Group 1 is less than the average of the ranks in Group 2 (i.e., R} < R,)/ The 
directional alternative hypothesis H,: 0, < 9, is not supported at the .01 level, since the 
obtained value U = 4 is greater than the tabled critical one-tailed value U,, = 1. 

The nondirectional alternative hypothesis H,: 0, # 0, is not supported, since the obtained 
value U = 4 is greater than the tabled critical two-tailed value Us. = 2. 

Since the data are not consistent with the directional alternative hypothesis H,: 0, > 6,, 
the latter alternative hypothesis is not supported. In order for the directional alternative 
hypothesis H,: 0, > 9, to be supported, the average of the ranks in Group 1 must be greater 
than the average of the ranks in Group 2 (i.e., R} > R,) (as well as the fact that the computed 
value of U must be equal to or less than the tabled critical one-tailed value at the prespecified 
level of significance). 

The results of the Mann-Whitney U test are consistent with those obtained when the t test 
for independent samples is employed to evaluate Example 11.1 (which employs the same set 
of data as Example 12.1). In both instances, the null hypothesis can only be rejected if the re- 
searcher employs a directional alternative hypothesis that predicts a lower degree of depression 
in the group that receives the antidepressant medication (Group 1). 


VI. Additional Analytical Procedures for the Mann-Whitney U 
Test and/or Related Tests 


1. The normal approximation of the Mann-Whitney U statistic for large sample sizes If 
the sample size employed in a study is relatively large, the normal distribution can be employed 
to approximate the Mann-Whitney U statistic. Although sources do not agree on the value of 
the sample size that justifies employing the normal approximation of the Mann-Whitney 
distribution, they generally state that it should be employed for sample sizes larger than those 
documented in the exact table of the U distribution contained within the source. Equation 12.4 
provides the normal approximation of the Mann-Whitney U test statistic. 








nn 
U = ~ 
Z = (Equation 12.4) 
nn (n + n, + 1) 
12 


In the numerator of Equation 12.4 the term (7, n,)/2 is often summarized with the notation 
U,,, which represents the expected (mean) value of U if the null hypothesis is true. In other 
words, if in fact the two groups are equivalent, it is expected that R, = R,. If the latter is true 
then U, = U,, and both of the values U, and U, will equal (n,n,)/2. The denominator in 
Equation 12.4 represents the expected standard deviation of the sampling distribution for the 
normal approximation of the U statistic. 
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Although Example 12.1 involves only N = 10 scores (a value most sources would view as 
too small to use with the normal approximation), it will be employed to illustrate Equation 12.4. 
The reader will see that in spite of employing Equation 12.4 with a small sample size, it yields 
a result that is consistent with the result obtained when the exact table for the Mann-Whitney 
U distribution is employed. It should be noted that since the smaller of the two values U, versus 
U, is selected to represent U, the value of z will always be negative (unless U, = U,, in which 
case z = 0). This is the case, since by selecting the smaller value U will always be less than the 
expected value U, = (n,n,)/2. 

Employing Equation 12.4, the value z = —1.78 is computed.‘ 








4 6) 
z= = 1:78 
X96 «5*1 
12 


The obtained value z = —1.78 is evaluated with Table A1 (Table of the Normal Distri- 
bution) in the Appendix. In order to be significant, the obtained absolute value of z must be 
equal to or greater than the tabled critical value at the prespecified level of significance. The 
tabled critical two-tailed .05 and .01 values are zog, = 1.96 and z,, = 2.58, and the tabled 
critical one-tailed .05 and .01 values are zo; = 1.65 and z,, = 2.33. The following guidelines 
are employed in evaluating the null hypothesis. 

a) If the nondirectional alternative hypothesis H,: 0, + 0, is employed, the null hypothe- 
sis can be rejected if the obtained absolute value of z is equal to or greater than the tabled critical 
two-tailed value at the prespecified level of significance. 

b) If a directional alternative hypothesis is employed, one of the two possible directional 
alternative hypotheses is supported if the obtained absolute value of z is equal to or greater than 
the tabled critical one-tailed value at the prespecified level of significance. The directional 
alternative hypothesis that is supported is the one that is consistent with the data. 

Employing the above guidelines with Example 12.1, the following conclusions are reached. 

Since the obtained absolute value z= 1.78 must be equal to or greater than the tabled critical 
value at the prespecified level of significance, the null hypothesis can only be rejected if the 
directional alternative hypothesis H,: 0, < 0, is employed. The directional alternative hy- 
pothesis H,: 0, < 0, is supported at the .05 level, since the absolute value z = 1.78 is greater 
than the tabled critical one-tailed value zo, = 1.65. As noted in Section V, the data are con- 
sistent with the directional alternative hypothesis H,: 0, < 0,. The directional alternative 
hypothesis H,: 0, < 6, is not supported at the .01 level, since the obtained absolute value 
z = 1.78 is less than the tabled critical one-tailed value zo, = 2.33. 

The nondirectional alternative hypothesis H,: 0, + 0, isnot supported, since the obtained 
absolute value z = 1.78 is less than the tabled critical two-tailed .05 value zo, = 1.96. 

Since the data are not consistent with the directional alternative hypothesis H,: 0, > 0,,the 
latter alternative hypothesis is not supported. As noted in Section V, in order for the latter 
directional alternative hypothesis to be supported, the following condition must be met: R, > R,. 

It turns out that the above conclusions based on the normal approximation are identical to 
those reached when the exact table of the Mann-Whitney U distribution is employed. 

It should be noted that, in actuality, either U, or U, can be employed in Equation 12.4 to 
represent the value of U. This is the case, since either value yields the same absolute value for 
z. Thus, if for Example 12.1 U, = 21 is employed in Equation 12.4, the value z = 1.78 is 
computed. Since the decision with respect to the status of the null hypothesis is a function of the 


© 2000 by Chapman & Hall/CRC 


absolute value of z, the value z = 1.78 leads to the same conclusions that are reached when z = -1.78 
is employed. The decision with regard to a directional alternative hypothesis is not affected, since 
the data are still consistent with the directional alternative hypothesis H,: 0, < 6,. 


2. The correction for continuity for the normal approximation of the Mann-Whitney U 
tes? Although not used in most sources, Siegel and Castellan (1988) employ a correction for 
continuity for the normal approximation of the Mann-Whitney test statistic. Marascuilo and 
McSweeney (1977) note that the correction for continuity is generally not employed, unless the 
computed absolute value of z is close to the prespecified tabled critical value. The correction for 
continuity, which reduces the absolute value of z, requires that .5 be subtracted from the absolute 
value of the numerator of Equation 12.4 (as well as the absolute value of the numerator of the 
alternative equation described in Endnote 4). The continuity-corrected version of Equation 12.4 
is provided by Equation 12.5. 

















U - € a5 
m —— M (Equation 12.5) 
nn (n * n, + 1) 
12 


If the correction for continuity is employed with Example 12.1, the value computed for the 
numerator of Equation 12.5 is 8 (in contrast to the value 8.5 computed with Equation 12.4). 
Employing Equation 12.5 with Example 12.1, the value z = —1.67 is computed. 








4 - ex) ng 
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Since the absolute value z = 1.67 is greater than the tabled critical one-tailed value 
Zos = 1.65, the directional alternative hypothesis H,: 0, < 9, is still supported at the .05 level. 


3. Tie correction for the normal approximation of the Mann-Whitney U statistic Some 
sources recommend that when an excessive number of ties are present in the data, a tie correction 
should be introduced into Equation 12.4. Equation 12.6 is the tie-corrected equation for the 
normal approximation of the Mann-Whitney U distribution. The latter equation results in a 
slight increase in the absolute value of z. 








y- 2 
ge pe 5 (Equation 12.6) 
- 3 
ni n(n, +n, +1) nm » (rst) 
12 12(n, + n)(n, + n, - 1) 


The only difference between Equations 12.4 and 12.6 is the term to the right of the element 
[nn (n; + n, + 1)]/12 in the denominator. The result of this subtraction reduces the value of 
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the denominator, thereby resulting in the slight increase in the absolute value of z. The term 
E (t? - t,) in the denominator of Equation 12.6 computes a value based on the number of ties 
in the data. In Example 12.1 there are s = 2 sets of ties. Specifically, there is a set of ties 
involving two subjects with the score 0, and a set of ties involving three subjects with the score 
11. The notation 2 , (t; - t,) indicates the following: a) For each set of ties, the number of 
ties in the set is subtracted from the cube of the number of ties in that set; and b) The sum of all 
the values computed in part a) is obtained. Thus, for Example 12.1: 


Y d - t) = IQ? - 2] + IG? - 3] = 30 
i=l 


The tie-corrected value z = —1.80 is now computed employing Equation 12.6. 








dos DM 
Zo = -1.80 
(5)5)5 +5 + 1) _ (5)(5)(30) 
12 12(5 + 55 + 5 - 1) 
The difference between z = —1.80 and the uncorrected value z = —1.78 is trivial, and 


consequently the decision the researcher makes with respect to the null hypothesis is not affected, 
regardless of which alternative hypothesis is employed.’ 


4. Sources for computing a confidence interval for the Mann-Whitney U test Various 
books that specialize in nonparametric statistics (e.g., Daniel (1990) and Marascuilo and 
McSweeney (1977)) describe the computational procedure for computing a confidence interval 
that can be used in conjunction with the Mann-Whitney U test. The confidence interval 
identifies a range of values that the true difference between the two population medians is likely 
to fall. 


VII. Additional Discussion of the Mann-Whitney U Test 


1. Power-efficiency of the Mann-Whitney U test When the underlying population distri- 
butions are normal, the asymptotic relative efficiency (which is discussed in Section VII of the 
Wilcoxon signed-ranks test (Test 6)) of the Mann-Whitney U test is .955 (when contrasted 
with the t test for two independent samples). For population distributions that are not normal, 
the asymptotic relative efficiency of the Mann-Whitney U test is generally equal to or greater 
than 1. As a general rule, proponents of nonparametric tests take the position that when a re- 
searcher has reason to believe that the normality assumption of the £ test for two independent 
samples has been saliently violated, the Mann-Whitney U test provides a powerful test of the 
comparable alternative hypothesis. 


2. Equivalency of the normal approximation of the Mann-Whitney U test and the t test for 
twoindependent samples with rank-orders Conover (1980, 1999), Conover and Iman (1981), 
and Zimmerman and Zumbo (1993) note that the large sample normal approximation of the 
Mann-Whitney U test (1.e., Equation 12.4) yields a result that is equivalent (with respect to the 
exact alpha level computed for the data) to that which will be obtained if the ¢ test for two 
independent samples is conducted on the same set of rank-orders. Even if the normal 
approximation is not employed, the results obtained from the Mann-Whitney U test through 
use of Equations 12.1 and 12.2 will be extremely close (in terms of the alpha value) to those that 
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will be obtained if the ¢ test for two independent samples is conducted on the same set of rank- 
orders. 


3. Alternative nonparametric rank-order procedures for evaluating a design involving two 
independent samples In addition to the Mann-Whitney U test a number of other nonpara- 
metric procedures for two independent samples have been developed that can be employed with 
ordinal data. Among the more commonly cited alternative procedures are the following: a) The 
Kolmogorov-Smirnov test for two independent samples (Test 13) (Kolmogorov (1933) and 
Smirnov (1939)), which is described in the next chapter; b) The van der Waerden normal- 
scores test for k independent samples (Test 23) (Van der Waerden (1953/1953)), which is 
described later in the book, as well as alternative normal-scores tests developed by Terry and 
Hoeffding (Terry (1952)) and Bell and Doksum (1965); c) Tukey's quick test (Tukey (1959)); 
d) The median test for independent samples (Test 16e), which involves dichotomizing two 
samples with respect to their median values, and evaluating the data with the chi-square test for 
rxc tables (Test 16); e) The Wald—Wolfowitz runs test (Wald and Wolfowitz (1940)) (briefly 
discussed under the single-sample runs test (Test 10) within the framework of Example 10.6); 
and f) Wilks’ empty-cell test for identical populations (Wilks (1961)). In addition to various 
books which specialize in nonparametric statistics, Sheskin (1984) describes these tests in greater 
detail. 


VIII. Additional Examples Illustrating the Use of the Mann- 
Whitney U Test 


The Mann-Whitney U test can be employed with any of the additional examples noted for the 
t test for two independent samples. Since Examples 11.4 and 11.5 use the same data as that 
employed in Example 12.1, they will yield the identical result. Examples 11.2 and 11.3 can also 
be evaluated with the Mann-Whitney U test, but employ different data than Example 12.1. The 
interval/ratio scores in all of the aforementioned examples have to be rank-ordered in order to 
employ the Mann-Whitney U test. 

Example 12.2 provides one additional example that can be evaluated with the Mann- 
Whitney U test. It differs from Example 12.1 in the following respects: a) The original scores 
are in a rank-order format. Thus, there is no need to transform the scores into ranks from an 
interval/ratio format (as is the case in Example 12.1). It should be noted though, that it is implied 
in Example 12.2 that the ranks are based on an underlying interval/ratio scale; and b) The sample 
sizes are unequal, with n, - 6 and n, - 7. 


Example 12.2 Doctor Radical, a math instructor at Logarithm University, has two classes in 
advanced calculus. There are six students in Class 1 and seven students in Class 2. The in- 
structor uses a programmed textbook in Class 1 and a conventional textbook in Class 2. At the 
end of the semester, in order to determine if the type of text employed influences student 
performance, Dr. Radical has another math instructor, Dr. Root, rank the 13 students in the two 
classes with respect to math ability. The rankings of the students in the two classes follow: 
Class 1: 1,3,5, 7,11, 13; Class 2: 2,4,6,8,9,10, 12 (assume the lower the rank the better the 
student). 


Employing the Mann-Whitney U test with Example 12.2 the following values are com- 
puted. 


ER -40  XR,-5 


© 2000 by Chapman & Hall/CRC 


U, = (60) + 


SED -40-23 


ps 
2 


U, - OM « 7 - 51 = 19 


(U, = 23) + (U, = 19) = (n, = OM, = 7) = 42 


19 - OD 
gue — — -.29 
(606 * 7 + 1) 
12 


Since the value U, - 19 is less than U, - 23, U 2 19. Employing Table A11, for 
n, = 6 and n, = 7, the tabled critical two-tailed values are Uy, = 6 and U,, = 3, and the 
tabled critical one-tailed values are U,, = 8 and U,, = 4. Since in order to be significant the 
obtained value U = 19 must be equal to or less than the tabled critical value, the null hypothesis 
H,: 0, = 9, cannot be rejected regardless of which alternative hypothesis is employed. The use 
of Equation 12.4 for the normal approximation confirms this result, since the absolute value 
z = 29 is less than the tabled critical two-tailed values z}; = 1.96 and Zo 2.58, and the 


tabled critical one-tailed values zo; = 1.65 and Z,, = 2.33. 
IX. Addendum 


Computer-intensive tests During the past 20 years the availability of computers has allowed 
for the use of hypothesis testing procedures which involve such an excessive amount of com- 
putation that in most instances it would be impractical to conduct such tests by hand or even with 
aconventional calculator. The first statisticians who discussed such tests were Fisher (1935) and 
Pitman (1937a, 1937b, 1938), who described procedures known as randomization or permu- 
tation tests. Aside from their computer friendliness, another reason for the increased popularity 
of the computer-intensive procedures (also referred to as data-driven procedures (Sprent, 
1998)) to be discussed in this section is that they have associated with them few if any of the 
distributional assumptions that underlie parametric tests, as well as certain nonparametric tests. 
As it is employed in this book, the term computer-intensive test/procedure is used to describe 
any of a variety of procedures that are computer dependent.’ 

The computer dependency of such tests reflects the fact that in addition to carrying out 
numerous computations, many of these tests employ the computer as a mechanism for repeated 
resampling of data. Julian Simon (1969) was among those who first discussed the advantages 
of employing computer based resampling in the analysis of data. Resampling is a process in 
which subsets of scores are selected from an original set of data. In the final analysis, the 
distinguishing feature between the various computer-intensive procedures that employ 
resampling is the specific protocol employed in selecting subsamples from the original set of 
data. Two resampling procedures that will be described in the Addendum are the bootstrap 
(developed by Efron (Efron (1979) and Efron and Tibshirani (1993)) and the jackknife 
(developed by Quenouille (1949) and named by Tukey (1958)).'? The intent of the discussion 
to follow is to present the reader with the basic principles underlying computer-intensive 
procedures. Those who are interested in a more in-depth discussion of this complex topic should 
consult sources such as Manly (1997) and Sprent (1998), or other relevant references that are 
cited in this Addendum. 
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Randomization and permutations tests The terms permutation test, randomization test, 
rerandomization test, and exact test are employed interchangeably by many sources (Good, 
(1994)). Within the context of such tests, the word permutation is used to represent a specific 
configuration/arrangement of scores. Such tests yield results that are exclusively a function of 
the observed data, and are not based on any assumptions regarding an underlying population 
distribution. The use of the term randomization test reflects the fact that in most instances such 
tests contrast two or more groups to which it is assumed subjects have been randomly assigned. 
Although the data such tests are employed to evaluate need not be a random sample, if random 
selection cannot be assumed, it will limit the generalizability of the test’s results. Randomization 
tests evaluate the data obtained in an experiment within the framework of the distribution of all 
possible random arrangements that can be obtained for that set of data. In a randomization test, 
instead of evaluating the outcome of an experiment in reference to some underlying theoretical 
population distribution (e.g., the normal, t, F distributions, etc.), the data itself are employed to 
construct the relevant sampling distribution. By constructing a sampling distribution based on 
the data, the researcher is not restricted by any assumptions which might be associated with an 
underlying theoretical distribution. The fact that it is a requirement of a randomization test to 
compute a separate sampling distribution is the reason why, in almost all instances, a computer 
is needed to conduct such a test. In this section the principles underlying randomization tests will 
be demonstrated in reference to evaluating an experiment involving two independent samples. 
In the example presented, a randomization test will be employed to do the following: a) Evaluate 
the interval/ratio scores of two independent groups; and b) Evaluate the rank-orderings of the 
same set of interval/ratio scores. In the case of the latter analysis, it will be demonstrated that 
it is equivalent to the analysis conducted for the Mann-Whitney U test with the same set of 
ranks. This is the case since the Mann-Whitney U test (as well as many other rank-order tests) 
represents an example of a randomization/permutation test that is based upon permutations of 
ranks (i.e., configurations/arrangements of ranks). 


Test 12a: The randomization test for two independent samples Among the other names 
that are employed for the test to be described in this section are Fisher's randomizaton test 
for two independent samples and the Fisher-Pitman test (since the procedure was first 
described by Fisher (1935) and Pitman (1937a, 1937b, 1938)). If a researcher has serious doubts 
regarding the normality and/or homogeneity of variance assumptions underlying the ¢ test for 
two independent samples, the randomization test for two independent samples provides 
one with a viable alternative for evaluating the data. The data will be represented by the interval/ 
ratio scores of two independent samples with n, scores in Group 1 and n, scores in Group 2, 
with n, + n, = N. Example 12.3 will be employed to illustrate the use of the randomization 
test for two independent samples." 


Example 12.3 Each of six subjects is randomly assigned to one of two groups, with n, = 3 
subjects in Group 1 and n, - 3 subjects in Group 2. While attempting to solve a mechanical 
problem, the subjects in Group 1 are exposed to a high concentration of atmospheric ozone. The 
subjects in Group 2 are required to solve the same mechanical problem under normal atmos- 
pheric conditions. The number of minutes it takes each of the subjects in the two groups to solve 
the problem follows: Group 1: 15, 18, 21; Group 2: 7,10, 11. Do the data indicate that there 
are differences between the two groups? 


The null and alternative hypotheses evaluated with the randomization test for two 
independent samples are stated below. Note that, as stated, the null and alternative hypotheses 
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do not include any reference to a population parameter. If a researcher included a population 
parameter in H) and H,, there would be certain assumptions about the underlying population 
that would have to be met. 


Null hypothesis H,: There is no difference in the performance of the two groups. If 
the null hypothesis is supported, one can conclude that the two groups represent the same 
population. 


Alternative hypothesis H,: The performance of the two groups is not equivalent. If the 
alternative hypothesis is supported, one can conclude that the two groups do not represent the 
same population. 

The alternative hypothesis stated above is nondirectional, and is evaluated with a two- 
tailed test. Either of the directionalalternative hypotheses noted below can also be employed. 
If a directional alternative hypothesis is employed, a one-tailed test is conducted. 


Alternative hypothesis H: The scores of subjects in Group 1 are higher than the scores of 
subjects in Group 2. 


Alternative hypothesis H: The scores of subjects in Group 1 are lower than the scores of 
subjects in Group 2. 


The general question that is addressed by the randomization test for two independent 
samples is as follows: If the scores of all N subjects are collapsed into a single group, how 
likely is it that the specific configuration of scores obtained in the experiment will be obtained 
if we randomly select 7, scores and assigned them to Group 1 and assign the remaining 
N - n, = n, scores to Group 2? Common sense suggests that if chance is operating, we will 
expect an equivalent distribution of scores in the two groups. On the other hand, if there are 
differences between the groups, it is expected that the distributions will not be equivalent. Thus, 
if one group has a preponderance of high scores and the other group has a preponderance of low 
scores, we will want to determine the exact likelihood of obtaining such an outcome purely as 
a result of chance. By constructing a sampling distribution from the data, the randomization 
test for two independent samples allows us to do this. 

In order to evaluate the data for Example 12.3, we must first answer the following question: 
How many ways can six subjects can be assigned to two groups with three subjects per group? 
This is equivalent to asking, what are the number of combinations of six things taken three at a 
time? (The reader may find it useful to review the discussion of combinations in Section IV of 
the binomial sign test for a single sample (Test 9).) Employing Equation 9.4, we determine 
that the number of combinations of six things taken three at a time is equal to 20: [S - uz 
— 20. This result tells us that there are 20 possible ways that six subjects can be assigned to two 
groups with three subjects per group." 

Table 12.3 summarizes the 20 possible ways in which the six scores 7, 10, 11, 15, 18, and 
21 can be distributed between two groups with n, = 3 and n, = 3. The left side of Table 12.3 
(Columns 2 and 3) lists the 20 possible ways three scores can be randomly assigned to Group 1 
(or Group 2), as well as the sum of the three scores for each arrangement. The last two columns 
of the table are based on the assumption that the six scores have been converted into ranks, and 
lists the rank-orders for the three interval/ratio scores in a given row (Column 4), and the sum 
of the rank-orders for that row (Column 5). Within the framework of the 20 arrangements that 
are listed, Column 4 contains each of the possible ways three rank-orderings within a set of six 
rank-orderings can be randomly assigned to Group 1 (or Group 2). 
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Table 12.3 Possible Arrangements for Example 12.3 Data 


Scores in Sum of Scores Ranks in Sum of Ranks 
Arrangement Group 1 in Group 1 Group 1 in Group 1 
1 7, 10, 11 28 1,2,3 6 
2 7, 10, 15 32 1,2,4 7 
3 7,11, 15 33 1,3,4 8 
4 7, 10, 18 35 1,2, 5 8 
5 7, 11, 18 36 1,3,5 9 
6 10, 11, 15 36 2,3,4 9 
7 7, 10,21 38 1,2,6 9 
8 7,11,21 39 1,3,6 10 
9 10, 11, 18 39 2,3,5 10 
10 7, 15, 18 40 1,4,5 10 
11 10, 11, 21 42 2,3,6 11 
12 7, 15,21 43 1,4,6 11 
13 10, 15, 18 43 2,4,5 11 
14 11, 15, 18 44 3,4,5 12 
15 10, 15, 21 46 2,4,6 12 
16 7, 18, 21 46 1,5,6 12 
17 11, 15, 21 47 3,4,6 13 
18 10, 18, 21 49 2,5,6 13 
19 11, 18, 21 50 3,5,6 14 
20 15, 18, 21 54 4,5,6 15 
Table 12.4 Sampling Distribution for Sums of Scores in Example 12.3 
Sum of Cumulative 
Scores Frequency Probability Probability 
28 1 .05 .05 
32 1 .05 .10 
33 1 .05 Jb» 
35 1 .05 20 
36 2 .10 30 
38 1 .05 .35 
39 2 .10 45 
40 1 .05 50 
42 1 .05 .55 
43 2 .10 .65 
44 1 .05 70 
46 2 .10 .80 
47 1 .05 .85 
49 1 .05 .90 
50 1 .05 .95 
54 1 .05 1.00 
Sums 20 1.00 


The values that are computed for the sums of the three interval/ratio scores for each of the 
20 arrangements will represent the points on the abscissa (X-axis) of the sampling distribution we 
will employ to evaluate the data. Since some of the 20 arrangements yield the same sum, there 
are a total of 16 possible values the sum of three scores can assume. The frequency or likelihood 
of occurrence for each of the 16 sums will be represented on the ordinate (Y-axis). Table 12.4 
summarizes the probabilities for the 16 sums of scores. Note that in Table 12.4 the sums of the 
scores are arranged in increasing order of magnitude, and that the sum of the column labelled 
Frequency is equal to 20, which is the total number of arrangements for the interval/ratio scores 
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for our data. The probability distribution in Columns 3 and 4 of Table 12.4 represents the 
sampling distribution that is employed within the framework of the randomization test for two 
independent samples to evaluate the null hypothesis. Column 3 lists the probability for the sum 
of scores in each row, whereas Column 4 lists the cumulative probability for the sum of scores 
in that row (cumulative probabilities are discussed in the Kolmogorov-Smirnov goodness-of-fit 
test for a single sample (Test 7)). 

Table 12.5 summarizes the sampling distribution for the sums of ranks. It turns out that for 
our data there are only 10 possible values the sum of three ranks can equal. The frequency or 
likelihood of occurrence for each of the 10 sums of ranks is summarized in Table 12.5. Note that 
in Table 12.5 the sums of the ranks are arranged in increasing order of magnitude, and that the 
sum of Column 2 (labelled Frequency) is equal to 20, which is the total number of arrangements 
of ranks for our data. The probability distribution in Table 12.5 (Column 3 lists the probability 
for the sum of ranks in each row, whereas Column 4 lists the cumulative probability for the sum 
of ranks in that row) represents the sampling distribution that can be employed within the frame- 
work of the randomization test for two independent samples to evaluate the null hypothesis. 
It should be emphasized, however, that the original test developed by Fisher (1935) and Pitman 
(19372, 1937b, 1938) evaluated the sums of interval/ratio scores and not summed values that 
were based on ranks. Thus, when summed values based on the ranks of two independent samples 
are evaluated using the Fisher-Pitman randomization procedure, the test is not generally 
referred to as the randomization test for two independent samples. Instead it is referred to as 
the Mann-Whitney U test. 


Table 12.5 Sampling Distribution for Sums of Ranks in Example 12.3 


Sum of Cumulative 
Ranks Frequency Probability Probability 
6 1 .05 .05 
7 1 .05 .10 
8 2 10 20 
9 3 15 35 
10 3 15 50 
11 3 15 65 
12 3 15 .80 
13 2 10 .90 
14 1 05 95 
15 1 .05 1.00 
Sums 20 1.00 


In order to reject the null hypothesis, the sum of scores/ranks (and consequently the arrange- 
ment which yields a specific sum) will have to be one that is highly unlikely to occur as a result 
of chance. In the case of a two-tailed alternative hypothesis this will translate into an arrange- 
ment with a sum of scores/ranks that is either very high or very low. In the case of a one-tailed 
alternative hypothesis, the following will apply: a) In order for the one-tailed alternative 
hypothesis predicting that the scores of the subjects in Group 1 are higher than the scores of the 
subjects in Group 2 to be supported, it will require an arrangement where the sum of scores/ranks 
for Group 1 is very high; b) In order for the one-tailed alternative hypothesis predicting that the 
scores of the subjects in Group 1 are lower than the scores of the subjects in Group 2 to be 
supported, it will require an arrangement where the sum of scores/ranks for Group 1 is very low. 

In point of fact, the sum of scores for the observed data in Group 1 is 54 (based on the three 
scores 15, 18, and 21), which is the largest possible sum in Tables 12.3 and 12.4. In the same 
respect, the sum of the ranks for the observed ranks in Group 1 is 15 (based on the ranks 4, 5, and 
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6 for the three scores 15, 18, and 21), which is the largest possible sum of ranks in Tables 12.3 
and 12.5. In both instances the likelihood of obtaining a sum of that magnitude is .05 (or 596). 
The latter value delineates the upper 596 of the sampling distribution we have derived for the 
data. This result allows us to reject the null hypothesis at the .05 level, but only for the one- 
tailed alternative hypothesis predicting that the scores of the subjects in Group 1 are higher 
than the scores of the subjects in Group 2. The latter alternative hypothesis is supported since 
there is only a 5% likelihood that a sum of scores or ranks equal to or greater than the one 
observed could have occurred as a result of chance. 

The null hypothesis cannot be rejected if the one-tailed alternative hypothesis predicting 
that the scores of the subjects in Group 1 are lower than the scores of the subjects in Group 2 is 
employed. This is the case, since the data are inconsistent with the latter alternative hypothesis. 

The null hypothesis cannot be rejected at the .05 level if the two-tailed alternative 
hypothesis is employed. The reason for this is that even though the observed data results in the 
highest possible sum of scores/ranks, the total number of possible arrangements is 20. As a result 
of the latter, the lowest probability possible for the most extreme arrangement in either direction 
(the highest or lowest sum) is 1/20 = .05. In order for a two-tailed alternative hypothesis to be 
supported, it will require that the observed data fall within the extreme 5% of the cases involving 
both tails of the distribution. One-half of that 596 will have to come from the upper tail of the 
sampling distribution, and the other half from the lower tail. Thus, in order for the two-tailed 
alternative hypothesis to be supported, it will require that the probability associated with the 
highest sum (as well as the lowest sum) be equal to or less than .05/2 = .025. In our example, 
the small sample size (which results in 20 arrangements) does not allow us to evaluate a two- 
tailed alternative hypothesis at the.05 level, since itis not possible for the highest sum (or lowest 
sum) to have a probability as low as .025. In order to evaluate the two-tailed alternative 
hypothesis at that level, there would have to be a minimum of 40 arrangements (since 1/40 = 
.025). It should be evident that because of our small sample size, it is not possible to evaluate 
the one-tailed or two-tailed alternative hypotheses at the .01 level. 

It was noted earlier in the discussion of randomization tests that the Mann-Whitney U test 
represents an example of a randomization test that is based upon permutations of ranks (i.e., 
configurations/arrangements of ranks). What this translates into is that, if the data for Example 
12.3 are evaluated with the Mann-Whitney U test, it should yield a result identical to that ob- 
tained with the randomization test for two independent samples, when the latter procedure is 
applied to ranks. This will now be demonstrated by evaluating the data for Example 12.3 with 
the Mann-Whitney U test. 

From Table 12.3 we know that the sum of the ranks for Group 1 is XR, = 15 (which is the 
sum of the ranks 4, 5, and 6 for the three scores 15, 18, and 21). Since the three scores for Group 
2 are the three lowest scores, the sum of the ranks for Group 2 will equal LR, = 6 (which is the 
sum of the ranks 1, 2, and 3 for the three scores 7, 10, and 11). Employing Equations 12.1 and 
12.2, we obtain U, - 0 and U, - 9. 


U, = (3) + 


38 * D _ 15.9 
2 


6-29 


U, = (3)(3) + UL d 


Since the smaller of the two values U, versus U, is designated as the U statistic, U — 0. 
Note that in Table A11 no tabled critical two-tailed .05 U value is listed for n, = 3 and n, = 3. 
The tabled critical one-tailed .05 value listed for n, = 3 and n, = 3 is Uy, = 0. No tabled 
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critical .01 U values are listed due to the small samples size. Since the obtained value U = 0 is 
equal to the tabled critical one-tailed .05 value U,, = 0, the one-tailed alternative hypothesis 
predicting that the scores of the subjects in Group 1 are higher than the scores of the subjects in 
Group 2 is supported at the .05 level. The one-tailed alternative hypothesis predicting that the 
scores of the subjects in Group 1 are lower than the scores of the subjects in Group 2 is not 
supported, since it is inconsistent with the observed data. Because no critical two-tailed values 
are listed in Table A11 for n, = 3 and n, = 3, the two-tailed alternative hypothesis cannot be 
evaluated at the.05 level. The results of the Mann-Whitney U test are thus identical to those 
obtained when the data for Example 12.3 are evaluated with the randomization test for two 
independent samples (in which case the one-tailed alternative hypothesis that is consistent with 
the data is supported at the .05 level). 

When the size of the samples for which a randomization test is employed becomes large, the 
number of possible combinations that can be computed for the data may become excessive to the 
point that it even becomes impractical for a computer to construct an exact sampling distribution 
based on every possible arrangement. In such a situation an approximate randomization test 
may be used. In the latter test, the computer constructs a sampling distribution based on randomly 
selecting a large number (but not all) of the possible arrangements for the data. The resulting 
sampling distribution is employed to evaluate the sample data. 

Good (1994, p. 114) notes that under certain conditions a randomization test may provide 
a more powerful test of an alternative hypothesis than a parametric procedure. Sprent (1998, p. 
52), however, points out a limitation of the randomization test for two independent samples 
is that if the test is employed with interval/ratio data containing one or more outliers, it behaves 
very much like the f test for two independent samples (and thus may be unreliable). Conover 
(1999, p. 408) notes that when the randomization test for two independent samples is applied 
to interval/ratio data, under certain conditions (e.g., outliers present in the data, skewed distribu- 
tions) it provides a less powerful test of an alternative hypothesis than an analogous parametric 
procedure or a nonparametric rank-order procedure (which as noted in this section may be a 
randomization test on rank-orders). Lundbrook and Dudley (1998) provide a good history of 
randomization tests, and discuss the merits of employing such tests in biomedical research. 
Manly (1997) provides a comprehensive discussion of randomization tests. 


Test 12b: The bootstrap Sprent (1998, p. 28) notes that although the philosophy underlying 
the bootstrap is different from that upon which permutation tests are based, it employs a similar 
methodology, and often yields results that are concordant with those which will be obtained with 
a permutation test. The bootstrap is based on the general assumption that a random sample can 
be used to determine the characteristics of the underlying population from which the sample is 
derived. However, instead of the using a sample statistic (e.g., the sample standard deviation) to 
estimate a population parameter (e.g., the population standard deviation), as is done within the 
framework of conventional parametric statistical tests, the bootstrap uses multiple samples derived 
from the original data to provide what in some instances may be a more accurate measure of the 
population parameter. The most common application of the bootstrap involves estimating a 
population standard error and/or confidence interval. 

The more ambiguous the information available to a researcher regarding an underlying 
population distribution, the more likely it is that the bootstrap may prove useful. Sprent (1998) 
notes that unlike permutation tests, which are able to provide a researcher with exact probability 
values, the use of the bootstrap only leads to approximate results. Nevertheless, in this case 
approximate results may, in the final analysis, be considerably more accurate than the results 
derived from an analysis that is based on an invalid theoretical model. For this reason, proponents 
of the bootstrap justify its use in circumstances where there is reasonable doubt regarding the 
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characteristics of the underlying population distribution from which a sample is drawn. The most 
frequent justification for using the bootstrap is when there is reason to believe that data may not 
be derived from a normally distributed population." Another condition that might merit the use 
of the bootstrap is one involving a sample that contains one or more outliers (outliers are dis- 
cussed in Section VII of the ¢ test for two independent samples). If, as a result of outliers being 
present in the data, a researcher elects to trim scores from the tails of a sample distribution, there 
is no equation available for providing the researcher with an unbiased estimate of certain popu- 
lation parameters (e.g., the standard error). 

Efron and Tibshirani (1993, p. 393) note that the boostrap implements familiar statistical 
calculations (e.g., computing standard errors, confidence intervals, etc.) in an unfamiliar way — 
specifically, through use of computer driven methods, as opposed to the use of mathematical 
equations. They state that, even though it employs a different methodology, the bootstrap is based 
on mathematical theory which insures its compatibility with traditional theories of statistical 
inference. Sprent (1993, p. 291) notes that bootstrapping can be a valuable technique when there 
is no clear analytic theory to obtain a measure of accuracy of an estimator. According to Efron 
and Tibshirani (1993, p. 394), at present the accuracy of the bootstrap is optimal in the estimation 
of values such as standard errors and confidence intervals, and it is weakest in evaluating those 
hypotheses which are typically evaluated with the more conventional inferential statistical tests. 

At this point the methodology for the bootstrap will be demonstrated with a simple example. 
It is assumed that within the framework of actual research, the procedure to be described below 
would be carried out with a computer. Additionally, the sample size employed in a study will 
generally be larger than the value n = 5 employed in the example. Obviously, if a sample is 
randomly drawn for a population, the larger the value of n (i.e., the sample size) the more likely 
itis that the sample will be representative of the population. It is important to note that as a result 
of chance factors, even a random sample can be unrepresentative of an underlying population. 
To the degree that a sample is not representative of a population, the bootstrap will not provide 
an accurate estimate of the parameter under investigation. The general issue of what size a 
sample should be in order to employ the bootstrap is subject to debate. Mooney and Duval 
(1993) note that in most instances a random sample comprised of 30-50 observations will be 
sufficient to provide a good bootstrap estimate. Although Manly (1997) states that the theory 
underlying the bootstrap insures that it will work well in certain situations involving large 
samples, he notes that in actuality a substantial amount of published research does not employ 
large samples. Under such circumstances Manly (1997) states that the use of the bootstrap 
becomes more problematical. To illustrate the latter, he provides an example involving data 
derived from an exponential distribution which requires a sample size of more than 100 in order 
for the bootstrap to provide an accurate estimate of a confidence interval.'® 

Anexample will now be presented to illustrate the application of the bootstrap methodology 
in computing a confidence interval. 


Example 12.4 Assume that five diamonds are randomly selected from a population of diamonds 
manufactured by one of the world’s largest distributor of precious stones. For each of the 
n=5 diamonds, the number of imperfections observed on its surface is recorded. Let us assume 
there is reason to believe that the distribution of imperfections in the underlying population of 
diamonds is not normal. The number of imperfections observed in each of the five diamonds that 
comprise our sample follows: 12, 7, 8, 2, 4. A quality control supervisor in the company 
requests a 99% confidence interval for the population standard deviation for the number of 
imperfections per diamond. Employ the bootstrap to compute a 99% confidence interval for the 
population standard deviation. (For a full discussion of confidence intervals the reader should 
review the appropriate material in Section VI of the single-sample ¢ test (Test 2).) 
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The methodology of the bootstrap requires that sampling with replacement be employed 
to select a large number of subsamples from the sample of five scores (Endnote 1 of the binomial 
sign test for a single sample (Test 9) explains the distinction between sampling with re- 
placement versus sampling without replacement). We will let m represent the number of 
subsamples. Each of the m subsamples will be referred to as a bootstrap sample. As is the case 
with our original sample, the sample size for each of the bootstrap samples will be n = 5. Since 
we will employ sampling with replacement, within any bootstrap sample any of the original five 
Scores can occur any number of times ranging from zero to five (since within any bootstrap 
sample, each score selected will always be randomly drawn from the pool of five scores that 
constitute the original sample). Let us assume we obtain m = 1000 bootstrap samples (all of which 
are randomly generated by a computer algorithm that selects random samples through use of 
sampling with replacement). The scores for the first two and last of the 1000 bootstrap samples 
are noted below. 


Bootstrap sample 1: 2, 2, 7, 8, 12 
Bootstrap sample 2: 8, 12, 7, 7, 12 


Bootstrap sample 1000: 2, 4, 12, 8, 12 


Employing Equation I.8, the computer calculates the unbiased estimate of the standard 
deviation for each of the m = 1000 bootstrap samples. The computation of $ is demonstrated 
for Bootstrap sample 1 (where XX, = 31and EX? = 265). 





§, = [265 - [(31)?/5] |/(5 - 1) -4.27 


The computed values of $ for Bootstrap sample 2 and Bootstrap sample 1000 are 
$, = 2.59 and Si) = 4.56. Once we have computed the standard deviation values for all 
1000 bootstrap samples, the values are arranged ordinally — i.e., from lowest to highest. At this 
point we can employ the 1000 $ values to determine the 99% confidence interval. The 99% 
confidence interval stipulates the range of values within which we can be 99% confident the 
population standard deviation falls (or stated probabilistically, there is a .99 probability the 
population standard deviation falls within the confidence interval). One-half of one per cent (.5% 
or .005 expressed as a proportion) of the scores fall to the left of the lower bound of the 99% 
confidence interval, and one-half of one per cent (.5%) of the scores fall to the right of the upper 
bound of the 99% confidence interval. The scores in between the two bounds fall within the 99% 
confidence interval. Thus, in the case of our m = 1000 bootstrap sample standard deviations, the 
middle 99% will fall within the 99% confidence interval. The extreme .5% in the left tail (i.e., 
the .5% lowest § values computed) and the extreme .5% in the right tail (1.e., the .5% highest 
§ values computed) will fall outside the 99% confidence interval. Thus, the lower bound of the 
99% confidence interval will be the score at the sixth ordinal position. We obtain the value of 
the latter ordinal position by multiplying m = 1000 (the total number of scores in our sampling 
distribution of standard deviation scores) by .005 (i.e., 5%) and adding 1 (i.e., m(.005) + 1 
= (1000)(.005) + 1 = 6). Five of the 1000 scores will fall below that point. The upper bound of 
the 99% confidence interval will be the score in the 995th ordinal position. We obtain the latter 
ordinal position value by multiplying m = 1000 by .005, and subtracting the resultant value of 
5 from 1000 (i.e., m — m(.005) = 1000 — (1000)(.005) = 995). Five of the 1000 scores will fall 
above that point. Assume that of the total 1000 bootstrap standard deviations computed, the six 
lowest and six highest values are listed below. 
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0 0 .45 .45 .89 89 wu cescersseseeee 4.27 4.47 4.47 4.47 4.67 4.67 


Since the sixth score from the bottom is .89 and the sixth score from the top (which 
corresponds ordinally to the 995th score) is 4.27, the latter values represent the bounds that 
define the 99% confidence interval. Thus, we can conclude that we can be 99% confident (or 
the probability is .99) that the true value of the population standard deviation falls within the 
range .89 and 4.27. This result can be summarized as follows: .89 « o < 4.27. 

It should be emphasized that the use of the bootstrap in Example 12.4 for computing a 
confidence interval will be predicated on the fact that the researcher is unaware of any acceptable 
mathematical method for accurately determining the confidence interval. Efron and Tibshirani 
(1993, p. 52) note that while estimating a standard error rarely necessitates more than 200 boot- 
strap samples, the computation of a confidence interval generally requires a minimum of 1000 
bootstraps, and often considerably more depending upon the nature of the problem being 
evaluated (see Efron and Tibshirani (1993, Ch. 12-14) for a more detailed discussion). 

The bootstrap can be applied to numerous hypothesis testing situations. As an example, 
Efron and Tibshirani (1993) and Sprent (1998) discuss the use of the bootstrap in evaluating the 
null hypothesis that two samples are derived from the same population. One approach to the 
two-sample hypothesis testing situation is to use the bootstrap in a manner analogous to that 
employed for a permutation test. However, instead of employing sampling without replacement, 
as is the case with a permutation test (since for each arrangement, n, scores are placed in one 
group, and the remaining N - n - n, scores by default constitute the other group), sampling 
with replacement is employed for the bootstrap. To illustrate, assume that we have two samples 
comprised of n, subjects in Group 1 and n, subjects in Group 2, with n, + n, = N. Using the 
total of N scores, employing sampling with replacement, we employ the computer to randomly 
select a large number of bootstrap samples, each sample being comprised of N scores. Within each 
bootstrap sample, the first n} scores are employed to represent the scores for Group 1 and the 
remaining N - m, = n, scores are employed to represent the scores for Group 2. For each 
bootstrap sample, a difference score between the two group means is computed. The empirical 
sampling distribution of the difference scores is employed to evaluate the null hypothesis. 
Specifically, itis determined how likely itis to have obtained the observed difference in the original 
data when it is considered within the framework of a large number of bootstrap sample difference 
scores. This is essentially what is done in a permutation test, and Efron and Tibshirani (1993, p. 
221) note that when a large number of bootstrap samples (e.g., 1000) are employed, the bootstrap 
and permutation test applied to the same set of data yield similar results. However, Efron and 
Tibshirani (1993, p. 220) make the general statement that although bootstrap tests are more widely 
applicable than permutation tests within the framework of hypothesis testing situations, they are not 
as accurate. As noted earlier in this section, the bootstrap is more accurate in estimating values such 
as standard errors and confidence intervals than it is in the area of hypothesis testing. 

In concluding this discussion, it should be emphasized that the bootstrap is a relatively new 
methodology. Consequently, at the present time, the method itself is the subject of considerable 
research which focuses on both its theoretical underpinnings and practical applications. Manly 
(1997) notes that although the bootstrap is based on a very simple concept, over the years the 
theory behind it has become increasingly complex. (Many of the applications of the bootstrap, 
as well as the theory behind them, are considerably more involved than what has been discussed 
in this section.) One could conjecture that if a researcher does not understand the theory or 
operations underlying a statistical procedure, one may become increasingly reluctant to use it, or, 
because of one's ignorance, use it inappropriately. Manly (1997) recommends that in the future, 
appropriate user-friendly applications of the bootstrap be integrated into standard statistical 
software. 
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Test 12c: The jackknife ^ Another computer-intensive procedure that was developed (by 
Quenouille (1949)) prior to the bootstrap is the jackknife. Like the bootstrap, the jackknife is 
an alternative methodology available to the researcher that can be employed for point/interval 
estimation. Specifically, under certain conditions the jackknife can be employed to reduce the 
degree of bias associated with point estimation (i.e., increase accuracy in estimating a population 
parameter). Like the bootstrap, the jackknife might be considered as a viable alternative in situ- 
ations where there is no clear analytic theory to obtain a measure of accuracy of an estimator. 
At this point a simple example will be presented to demonstrate how the jackknife can be 
employed to estimate the degree of bias associated with an estimator of a parameter. The data 
for Example 12.4 will be employed to demonstrate the use of the jackknife in estimating bias in 
the following situations: a) Employing X, the sample mean (which is known to be an unbiased 
estimator of the population mean (u)), to estimate the value of the population mean; and 
b) Employing the sample variance (s?) as computed with Equation L4 (which is known to be a 
biased estimator of the population variance ( 0? )) to estimate the value of the population variance. 

Employing Equation I.1, we compute the sample mean for the data of Example 12.4 
(i.e., the n = 5 scores 12, 7, 8,2, 4) tobe X = 6.6. Employing Equation I.4, the sample variance 
(s2), which is known to be a biased estimate of o°, is computed to be s? = 11.84 
(s? = [277 - [(33)?/5]]/5 = 11.84). Employing Equation I.5, the estimated population vari- 
ance ($2), which is known to be an unbiased estimate of o°, is computed to be $? = 14.8 
($? = [277 - [(33)°/5]]/4 = 14.8). 

The methodology of the jackknife requires that n subsamples, to be designated jackknife 
samples, be derived from the original sample of n scores. Each of the jackknife samples will be 
comprised of (n — 1) scores. In each of the n jackknife samples one of the original n scores is 
omitted. From the original set of 5 scores 12, 7, 8, 2, 4, we can derive the following five 
jackknife samples, comprised of four scores per sample: Jackknife sample 1: 7, 8, 2, 4; 
Jackknife sample 2: 12, 8, 2, 4; Jackknife sample 3: 12, 7, 2, 4; Jackknife sample 4: 12, 7, 
8, 4; Jackknife sample 5: 12, 7, 8, 2. Note that in Jackknife sample 1 the first of the five 
scores listed is omitted, and in Jackknife sample 2 the second of the five scores listed is omitted, 
and so on. Thus, each of the samples is comprised of four of the original five scores. 

Employing Equations I.1 and I.4 we compute the mean (X) and sample variance (5?) for 
each of the five jackknife samples. These values are summarized below. 


X =5.25 X, =6.5 X, =6.25 X, =7.75 X, =7.25 
sè = 5.6875 s; = 14.75 sj = 14.1875 s; = 8.1875 sj = 12.6875 


Sprent (1993, p. 285) notes that Equation 12.7 is the general equation for computing a 
jackknife estimate of a specific parameter 0 (where 0 is the lower case Greek letter theta). 
Equation 12.8 is the general equation for the jackknife estimate of the degree of bias associated 
with the estimate, which will be represented with value B. 


9; = nO’ - (n - 10, (Equation 12.7) 


Where: 9; represents the jackknife estimate of the parameter 
0’ represents the value of the statistic computed for the parameter from the original 
sample data comprised of n scores 
0, represents the average value of the sample statistic computed for the n jacknife 
samples. (The same statistic employed to compute 0' should be employed to compute 
the estimate of the parameter for the n jackknife samples.) 
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B = (n - 10, - 05 (Equation 12.8) 


In employing the jackknife estimator of the population mean (u), the symbol 0 will be used 
to represent the population mean. The following values will be employed in Equations 12.7 and 
12.8: a) n-5; b) 0' - X - 6.6,since that is the mean value of the five scores 12, 7, 8, 2, 4; 
andc) 0, = 6.6, since that is the mean of five jackknife samples (i.e., (5.25 + 6.5 + 6.25 + 7.75 
+ 7.25)/5 = 6.6). The appropriate values are substituted in Equations 12.7 and 12.8. 


0; = 5(6.6) - (5 - 1)(6.6) = 6.6 


B = (5 - 1)(6.6 - 6.6) = 0 


The computed value 0, = 6.6 is the jackknife estimate of the population mean. The 
computed value B = 0 indicates that the degree of bias associated with the sample statistic X in 
estimating the population mean is zero. Since in the case of the population mean it is known that 
X provides an unbiased estimate of p, it is expected that the value zero will be computed for B. 

In employing the jackknife estimator of the population variance (0°), the symbol 0 will be 
used to represent the population variance. The following values will be employed in Equations 
12.7 and 12.8: a)n = 5; b) 0' = s? = 11.84, since that is the value computed with Equation 
1.4 for the sample variance for the five scores 12, 7, 8, 2, 4; and c) 0, = 11.1, since that is the 
mean of the variances of the five jackknife samples (1.e., (5.6875 + 14.75 + 14.1875 + 8.1875 + 
12.6875)/5 = 11.1). The appropriate values are substituted in Equations 12.7 and 12.8. 


9; = 5(11.84) - (5 - 111.1) = 14.8 


B = (5 - 1) (11.1 - 11.84) = -2.96 


The computed value 0. = 14.8 is the jackknife estimate of the population variance. The 
computed value B = —2.96 indicates that the degree of bias associated with the sample statistic 
s? in estimating the population variance is —2.96 (the negative sign indicating underestimation 
of the population variance by 2.96 units through use of the sample statistic s?). Note that the 
computed value 9; = 14.8 is 2.96 units above the value s? = 11.84, and that 0; = 14.8, is, in 
fact, equivalent to the unbiased estimate of the population variance $? = 14.8 computed with 
Equation L5. In point of fact, if the value $? - 14.8 is employed in Equations 12.7 and 12.8, 
to represent 0', Equation L5 is employed to compute the five jackknife variances (and the 
average of the five jacknife $? values is O, = 14.8). In such a case, the jackknife estimate of 
the variance computed with Equation 12.7 remains unchanged, yielding the value 0; - 14.8. 
However, the resulting value computed with Equation 12.8 will equal zero, since, as is the case 
when X is employed to estimate u, $? = 14.8 is an unbiased estimate of the population variance. 

Sprent (1993, 1998) and Efron and Tibshirani (1993, p. 148) note that the jackknife will be 
ineffective in reducing bias for some parameters. As an example, both of the aforementioned 
authors demonstrate the ineffectiveness of the jackknife as an estimator of the population median. 
These authors, as well as Hollander and Wolfe (1999) discuss a modified jackknife procedure 
referred to as the delete-d jackknife. In the latter modification of the standard jackknife, d 
scores are omitted from each of the jackknife samples, where d is some integer value greater than 
1. Under certain conditions the latter procedure may provide a more accurate estimate of a 
population parameter than use of the conventional jackknife where d= 1. The reader should take 
note of the fact that for a set of n scores, the number of jackknife samples will be limited in 
number and be a function of the value of n, while the number of bootstrap samples that can be 
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employed for a set of n scores is unlimited. Within this context, Efron and Tibshirani (1993) note 
that since the jackknife employs less data than the bootstrap, it is less efficient, and in the final 
analysis, what the jackknife provides is an approximation of the results obtained with the boot- 
strap. Manly (1997) and Mooney and Duval (1993) (both of whom present detailed discussions 
of the jackknife and the bootstrap) describe analytical situations in which they state or demon- 
strate that the jackknife provides more accurate information than the bootstrap. 


Final comments on computer-intensive procedures Procedures such as the bootstrap and 
jackknife are commonly discussed in books that address the general subject of robust statistical 
procedures (e.g., Huber (1981), Sprent (1993, 1998), Staudte and Sheather (1990)) — the latter 
term referring to statistical procedures which are not overly dependent on critical assumptions 
regarding an underlying population distribution (the concept of robustness is discussed in Section 
VII of the t test for two independent samples). Conover (1999, p. 116) notes that the term 
robustness is most commonly applied to methods that are employed when the normality 
assumption underlying an inferential statistical test is violated. He points out that in spite of the 
fact that when sample sizes are reasonably large certain tests such as the single-sample ¢ test and 
the f test for two independent samples are known to be robust with respect to violation of the 
normality assumption (i.e., the accuracy of the tabled critical alpha values for the test statistics 
are not compromised), if the underlying distribution is not normal, the power of such tests may 
still be appreciably reduced. Related to this is the fact that Staudte and Sheather (1990, p. 14) 
paint a bleak picture regarding the power of commonly employed goodness-of-fit tests for 
normality. Specifically, these authors suggest that unless a sample size is relatively large, 
goodness-of-fit tests for normality (such as the Kolmogorov-Smirnov goodness-of-fit test for 
a single sample or the chi-square goodness-of-fit test (Test 8)) will generally not result in 
rejection of the null hypothesis of normality, unless the fit with respect to normality is dramat- 
ically violated. Consequently, they conclude that most goodness-of-fit tests are ineffective 
mechanisms for providing confirmation for the normal distribution assumption that more often 
than notresearchers assume characterizes an underlying population. Staudte and Sheather (1990) 
argue that as a result of the failure of goodness-of-fit tests to reject the normal distribution model, 
procedures based on the assumption of normality all too often are employed with data that are 
derived from nonnormal populations. In instances where the normality assumption is violated, 
Staudte and Sheather (1990) encourage a researcher to consider employing a robust statistical 
procedure (such as the bootstrap) to analyze the data. In accordance with this view, Sprent 
(1998) notes that the bootstrap will often yield a more accurate result for a nonnormal population 
than will analysis of the data with a statistical test which assumes normality. 

Another characteristic of data that is often discussed within the framework of robust 
statistical procedures is the subject of outliers alluded to earlier in this section (for a more com- 
prehensive discussion of outliers, the reader should consult Section VII of the £ test for two 
independent samples). Research has shown that a single outlier can substantially compromise 
the power of a parametric statistical test. (Staudte and Sheather (1990) provide an excellent 
example of this involving the single-sample f test.) Various sources suggest that when one or 
more outliers are present in a set of data, a computer-intensive procedure (such as the bootstrap 
or jackknife) may provide a researcher with more accurate information regarding the underlying 
population(s) than a parametric procedure. 


References 


Barnett, V. and Lewis, T. (1994). Outliers in statistical data (3rd ed.). Chichester: John Wiley 
and Sons. 


© 2000 by Chapman & Hall/CRC 


Bell, C. B. and Doksum, K. A. (1965). Some new distribution-free statistics. Annals of Mathe- 
matical Statistics, 36, 203—214. 

Conover, W. J. (1980). Practical nonparametric statistics (2nd ed.). New York: John Wiley 
and Sons. 

Conover, W. J. (1999). Practical nonparametric statistics (3rd ed.). New York: John Wiley 
and Sons. 

Conover, W. J. and Iman, R. L. (1981). Rank transformations as a bridge between parametric 
and nonparametric statistics. The American Statistician, 35, 124—129. 

Daniel, W. W. (1990). Applied Nonparametric statistics (2nd ed.). Boston: PWS-Kent 
Publishing Company. 

Efron, B. (1979). Bootstrap methods: another look at the jackknife. Annals of Mathematical 
Statistics, 7, 1—26. 

Efron, B. and Tibshirani R. J. (1993). Anintroduction to the bootstrap. New York: Chapman 
and Hall. 

Fisher, R. A. (1935). The design of experiments. Edinburgh: Oliver and Boyd. 

Good, P. (1994). Permutation tests: A practical guide to resampling methods for testing 
hypotheses. New York: Springer. 

Hartigan, J. A. (1969). Using subsample values as typical values. Journal of the American 
Statistical Association, 64, 1303-1317. 

Hollander, M. and Wolfe, D.A. (1999). Nonparametric statistical methods. New York: John 
Wiley and Sons. 

Huber, P. J. (1981). Robust Statistics. New York: John Wiley and Sons. 

Kolmogorov, A. N. (1933). Sulla determinazione empiraca di una legge di distribuzione. Giorn 
dell’Inst. Ital. degli. Att., 4, 89-91. 

Lundbrook, J. and Dudley, H. (1998). Why permutation tests are superior to the ¢ and F tests 
in biomedical research. The American Statistician, 52, 127-132. 

Manly, B. F. J. (1997). Randomization, bootstrap and Monte Carlo methods in biology (2nd 

ed.). London: Chapman and Hall. 

Mann, H. and Whitney, D. (1947). On a test of whether one of two random variables is sto- 

chastically larger than the other. Annals of Mathematical Statistics, 18, 50—60. 

Marascuilo, L. A. and McSweeney, M. (1977). Nonparametric and distribution-free methods 

for the social sciences. Monterey, CA: Brooks/Cole Publishing Company. 

Maxwell, S. E. and Delaney, H. (1990). Designing experiments and analyzing data. 

Monterey, CA: Wadsworth Publishing Company. 

Mooney, C. Z. And Duval, D. (1993). Bootstrapping: A nonparametric approach to 
statistical inference. Newbery Park, CA: Sage Publications. 

Pitman, E. J. G. (19372). Significance tests that may be applied to samples from any population. 
Journal of the Royal Statistical Society: Supplement, 4, 119—130. 

Pitman, E. J. G. (1937b). Significance tests that may be applied to samples from any population, 
fll. The correlation coefficient test. Journal of the Royal Statistical Society: Supple- 
ment, 4, 225-232. 

Pitman, E. J. G. (1938). Significance tests that may be applied to samples from any population, 
III. The analysis of variance test. Biometrika, 29, 322-335. 

Quenouille, M. H. (1949). Approximate tests of correlation in time series. Journal of the Royal 
Statistical Society, B, 11, 18-84. 

Sheskin, D. J. (1984). Statistical tests and experimental design: A guidebook. New York: 
Gardner Press. 

Siegel, S. and Castellan, N. J., Jr. (1988). Nonparametric statistics for the behavioral sciences 
(2nd ed.). New York: McGraw-Hill Book Company. 





© 2000 by Chapman & Hall/CRC 


Simon, J. L. (1969). Basic research methods in social science. New York: Random House. 

Smirnov, N. V. (1939). On the estimation of the discrepancy between empirical curves of 
distributions for two independent samples. Bulletin University of Moscow, 2, 3-14. 

Sprent, P. (1993). Applied nonparametric statistical methods (2nd ed.). London: Chapman 
and Hall. 

Sprent, P. (1998). Data driven statistical methods. London: Chapman and Hall. 

Staudte R. G. and Sheather S. J. (1990). Robust estimation and testing. New York: John 
Wiley and Sons. 

Terry, M. E. (1952). Some rank-order tests, which are most powerful against specific parametric 
alternatives. Annals of Mathematical Statistics, 23, 346—366. 

Tukey, J. W. (1958). Bias and confidence in not quite large samples (Abstract). Annals of 
Mathematical Statistics, 29, 614. 

Tukey, J. W. (1959). A quick, compact, two-sample test to Duckworth's specifications. 
Technometrics, 1, 31—48. 

Van der Waerden, B. L. (1952/1953). Order tests for the two-sample problem and their power. 
Proceedings Koninklijke Nederlandse Akademie van Wetenshappen (A), 55 (Indaga- 
tiones Mathematicae 14), 453—458, and 56 (Indagationes Mathematicae, 15), 303-316 
(corrections appear in Vol. 56, p. 80). 

Wald, A. and Wolfowitz, J. (1940). On a test whether two samples are from the same 
population. Annals of Mathematical Statistics, 11, 147—162. 

Wilcoxon, F. (1949). Some rapid approximate statistical procedures. Stamford, CT: Stam- 
ford Research Laboratories, American Cyanamid Corporation. 

Wilks, S. S. (1961). A combinatorial test for the problems of two samples from continuous 
distributions. In J. Neyman (Ed.), Proceedings of the Fourth Berkeley Symposium on 
Mathematical Statistics and Probability. Berkeley and Los Angeles: University of 
California Press, Vol. L, 707—717. 

Zimmerman, D. W. And Zumbo, B. D. (1993). The relative power of parametric and non- 
parametric statistical methods. In Keren, G. and Lewis, C. (Eds.), A handbook for data 
analysis in the behavioral sciences: Methodological issues. Hillsdale, N.J.: Lawrence 
Erlbaum Associates, Publishers. 


Endnotes 


1. The test to be described in this chapter is also referred to as the Wilcoxon rank-sum test 
and the Mann-Whitney-Wilcoxon test. 


2. The reader should take note of the following with respect to the table of critical values for 
the Mann-Whitney U distribution: a) No critical values are recorded in the Mann- 
Whitney table for very small sample sizes, since a level of significance of .05 or less 
cannot be achieved for sample sizes below a specific minimum value; b) The critical values 
published in Mann-Whitney tables by various sources may not be identical. Such 
differences are trivial (usually one unit), and are the result of rounding off protocol; and c) 
The table for the alternative version of the Mann-Whitney U test (which was developed 
by Wilcoxon (1949)) contains critical values that are based on the sampling distribution of 
the sums of ranks, which differ from the tabled critical values contained in Table A11 
(which represents the sampling distribution of U values). 


3. Although for Example 12.1 we can also say that since YR, < YR, the data are consistent 
with the directional alternative hypothesis H,: 0, < 0,, the latter will not necessarily 
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always be the case when n, + n,. Since the relationship between the average of the ranks 
will always be applicable to both equal and unequal sample sizes, it will be employed in 
describing the hypothesized relationship between the ranks of the two groups. 


Some sources employ an alternative normal approximation equation which yields the same 
result as Equation 12.4. The alternative equation is noted below. 
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Note that the only difference between the above equation and Equation 12.4 is with respect 
to the numerator. If the value XR, = 19 from Example 12.1 is substituted in the above 
equation, it yields the value —8.5 for the numerator (which is the value for the numerator 
computed with Equation 12.4), and consequently the value z =- 1.78. If XR, is employed 
in the numerator of the above equation in lieu of XR,, the numerator of the equation 
assumes the form XR, - [[n,(n, + n, + 1)]/2]. If ER, = 36 is substituted in the revised 
numerator, the value 8.5 is computed for the numerator, which results in the value z = 1.78. 
(Later on in this discussion it will be noted that the same conclusions regarding the null 
hypothesis are reached with the values z = 1.78 and z = 1.78.) The above equation is 
generally employed in sources that describe the version of the Mann-Whitney U test 
developed by Wilcoxon (1949). The latter version of the test only requires that the sum of 
the ranks be computed for each group, and does not require the computation of U values. 
As noted in Endnote 1, the table of critical values for the alternative version of the test is 
based on the sampling distribution of the sums of the ranks. 


A general discussion of the correction for continuity can be found under the Wilcoxon 
signed-ranks test. 


Some sources employ the term below for the denominator of Equation 12.6. It yields the 
identical result. 
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A correction for continuity can be used by subtracting .5 from the value computed in the 
numerator of the Equation 12.6. The continuity correction will reduce the absolute value 
of z. 


The rationale for discussing computer-intensive procedures in the Addendum of the 
Mann-Whitney U test is that the Mann-Whitney test (as well as many other rank-order 
procedures) can be conceptualized an example of a randomization or permutation test, 
which is the first of the computer based procedures to be described in the Addendum. 


Another application of a computer-intensive procedure is Monte Carlo research which is 
discussed in Section VII of the single-sample runs test. 
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10. The term bootstrap is derived from the saying that a person lifts oneself up by one's 
bootstraps. Within the framework of the statistical procedure, bootstrapping indicates that 
a single sample is used as a basis for generating multiple additional samples — in other 
words, one makes the most out of what little resources one has. Manly (1997) notes that 
the use of the term jackknife is based on the idea that a jackknife is a multipurpose tool 
which can be used for many tasks, in spite of the fact that for any single task it is seldom 
the best tool. 


11. The reader should take note of the fact that although Example 12.3 involves two indepen- 
dent samples, by using the basic methodology to be described in this section, randomization 
tests can be employed to evaluate virtually any type of experimental design. 


12. a)Suppose in the above example we have an unequal number of subjects in each group. 
Specifically, let us assume two subjects are randomly assigned to Group 1 and four subjects 
to Group 2. The total number of possible arrangements will be the combination of six 
things taken two at a time, which is equivalent to the combination of six things taken four 
at a time. This results in 15 different arrangements: (5) = (4 - Tz = 15; b) To 
illustrate how large the total number of arrangements can become, suppose we have a total 
of 40 subjects and randomly assign 15 subjects to Group 1 and 25 subjects to Group 2. The 
total number of possible arrangements is the combination of 40 things taken 15 at a time, 
which is equivalent to the combination of 40 things taken 25 at a time. The total number 
of possible arrangements will be( "t g & - A = 40,225,345,060. Obviously, 
without the aid of a computer it will be impossible to evaluate such a large number of 


arrangements. 


13. The reader should take note of the following: a) If the 20 corresponding arrangements for 
Group 2 are listed in Table 12.3, the same 20 arrangements that are listed for Group 1 will 
appear in the table, but in different rows. To illustrate, the first arrangement in Table 12.3 
for Group 1 is comprised of the scores 7, 10, 11. The corresponding Group 2 arrangement 
is 15, 18, 21 (which are the three remaining scores in the sample). In the last row of Table 
12.3 the scores 15, 18, 21 are listed. The corresponding arrangement for Group 2 will be 
the remaining scores, which are 7, 10, 11. If we continue this process for the remaining 18 
Group 1 arrangements, the final distribution of arrangements for Group 2 will be comprised 
of the same 20 arrangements obtained for Group 1; b) When n, + n,, the distribution of 
the arrangements of the scores in the two groups will not be identical, since all the 
arrangements in Group 1 will always have n, scores and all the arrangements in Group 2 
will always have n, scores, and n, # n,. Nevertheless, computation of the appropriate 
sampling distribution for the data only requires that a distribution be computed which is 
based on the arrangements for one of the two groups. Employing the distribution for the 
other group will yield the identical result for the analysis. 


14. Although the result obtained with the Mann-Whitney U test is equivalent to the result that 
will be obtained with a randomization/permutation test conducted on the rank-orders, only 
the version of the test that was developed by Wilcoxon (1949) directly evaluates the 
permutations of the ranks. Marascuilo and McSweeney (1977, pp. 270-272) and Sprent 
(1998, pp. 85-86) note that the version of the test described by Mann and Whitney (1947) 
actually employs a statistical model which evaluates the number of inversions in the data. 
An inversion is defined as follows: Assume that we begin with the assumption that all the 
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scores in one group (designated Group 1) are higher than all the scores in the other group 
(designated Group 2). If we compare all the scores in Group 1 with all the scores in Group 
2, an inversion is any instance in which a score in Group 1 is not higher than a score in 
Group 2. It turns out an inversion based model yields a result that is equivalent to that 
obtained when the permutations of the ranks are evaluated (as is done in Wilcoxon's (1949) 
version of the Mann-Whitney U test). Employing the data from Example 12.1, consider 
Table 12.6 where the scores in Group 1 are arranged ordinally in the top row, and the scores 
in Group 2 are arranged ordinally in the left column. 


Table 12.6 Inversion Model of Mann-Whitney U test 


Group 1 Scores 




















Each of the scores in the cells of Table 12.6 are difference scores that are the result of 
subtracting the score in the row a cell appears (i.e., the corresponding Group 2 score) from 
the score at the top of the column in which the cells appears (i.e., the corresponding Group 
1 score). Note that any negative difference score recorded in the table meets the criterion 
established for an inversion. Ties result in a zero value for a cell. The total number of 
inversions in the data is based on the number of negative difference scores and the number 
of ties. One point is allocated for each negative difference score, and one-half a point for 
each zero/tie (since the latter is viewed as a less extreme inversion than one associated with 
a negative difference score). Employing the aforementioned protocol, the number of 
inversions in the Table 12.6 is the 20 negative difference scores plus the two zeros, which 
sums to 21 inversions. Note that the latter value corresponds to the value computed for 
U, with Equation 12.1. The value U, - 4 inversions will be obtained if Table 12.6 is 
reconstructed so that the score in each column (i.e., Group 1 score) is subtracted from the 
score in the corresponding row (i.e., Group 2 score). 


15. Although in this example the identical probability is obtained for the highest sum of scores 
and highest sum of ranks, this will not always be the case. 


16. Efronand Tibshirani (1993, p. 394) note that the bootstrap differs from more conventional 
simulation procedures (discussed briefly in Section VII of the single-sample runs test), in 
that in conventional simulation, data are generated through use of a theoretical model (such 
as sampling from a theoretical population such as a normal distribution for which the mean 
and standard deviation have been specified). In the bootstrap the simulation is data-based. 
Specifically, multiple samples are drawn from a sample of data that is derived in an exper- 
iment. One problem with the bootstrap is that since it involves drawing random subsamples 
from a set of data, two or more researchers conducting a bootstrap may not reach the same 
conclusions due to differences in the random subsamples generated. 


17. Sprent (1998) notes that when the bootstrap is employed correctly in situations where the 
normality assumption is not violated, it generally yields conclusions that will be consistent 
with those derived from the use of conventional parametric and nonparametric tests, as well 
as permutation tests. 
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18. The exponential distribution is a continuous probability distribution which is often useful 
in investigating reliability theory and stochastic processes. Manly (1997) recommends that 
prior to employing the bootstrap for inferential purposes, it is essential to evaluate its per- 
formance with small samples derived from various theoretical probability distributions 
(such as the exponential distribution). Monte Carlo studies (i.e., computer simulations 
involving the derivation of samples from theoretical distributions for which the values of 
the relevant parameters have been specified) can be employed to evaluate the reliability of 
the bootstrap. 
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Test 13 


The Kolmogorov-Smirnov Test for 


Two Independent Samples 
(Nonparametric Test Employed with Ordinal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Do two independent samples represent two different popu- 
lations? 


Relevant background information on test The Kolmogorov-Smirnov test for two inde- 
pendent samples was developed by Smirnov (1939). Daniel (1980) notes that because of the 
similarity between Smirnov's test and a goodness-of-fit test developed by Kolmogorov (1933) 
(the Kolmogorov-Smirnov goodness-of-fit test for a single sample (Test 7)), the test to be 
discussed in this chapter is often referred to as the Kolmogorov-Smirnov test for two inde- 
pendent samples (although other sources (Conover (1980, 1999)) simply refer to it as the 
Smirnov test). 

Daniel (1990), Marascuilo and McSweeney (1977), and Siegel and Castellan (1988) note 
that when a nondirectional/two-tailed alternative hypothesis is evaluated, the Kolmogorov- 
Smirnov test for two independent samples is sensitive to any kind of distributional difference 
(i.e., a difference with respect to location/central tendency, dispersion/variability, skewness, and 
kurtosis). When a directional/one-tailed alternative hypothesis is evaluated, the test evaluates the 
relative magnitude of the scores in the two distributions. 

As is the case with the Kolmogorov-Smirnov goodness-of-fit test for a single sample 
discussed earlier in the book, computation of the test statistic for the Kolmogorov-Smirnov test 
for two independent samples involves the comparison of two cumulative frequency distri- 
butions. Whereas the Kolmogorov-Smirnov goodness-of-fit test for a single sample compares 
the cumulative frequency distribution of a single sample with a hypothesized theoretical or 
empirical cumulative frequency distribution, the Kolmogorov-Smirnov test for two indepen- 
dent samples compares the cumulative frequency distributions of two independent samples. If, 
in fact, the two samples are derived from the same population, the two cumulative frequency 
distributions would be expected to be identical or reasonably close to one another. The test 
protocol for the Kolmogorov-Smirnov test for two independent samples is based on the 
principle that if there is a significant difference at any point along the two cumulative frequency 
distributions, the researcher can conclude there is a high likelihood the samples are derived from 
different populations. 

The Kolmogorov-Smirnov test for two independent samples is categorized as a test of 
ordinal data because it requires that cumulative frequency distributions be constructed (which 
requires that within each distribution scores be arranged in order of magnitude). Further 
clarification of the defining characteristics of a cumulative frequency distribution can be found 
in the Introduction, and in Section I of the Kolmogorov-Smirnov goodness-of-fit test for a 
single sample. Since the Kolmogorov-Smirnov test for two independent samples represents 
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a nonparametric alternative to the £ test for two independent samples (Test 11), the most 
common situation in which a researcher might elect to employ the Kolmogorov—Smirnov test 
to evaluate a hypothesis about two independent samples (where the dependent variable repre- 
sents interval/ratio measurement) is when there is reason to believe that the normality and/or 
homogeneity of variance assumption of the f test have been saliently violated. The 
Kolmogorov- Smirnov test for two independent samples is based on the following 
assumptions: a) All of the observations in the two samples are randomly selected and 
independent of one another; and b) The scale of measurement is at least ordinal. 


II. Example 


Example 13.1 is identical to Examples 11.1/12.1 (which are evaluated with the ¢ test for two 
independent samples and the Mann-Whitney U test (Test 12)). 


Example 13.1 In order to assess the efficacy of a new antidepressant drug, ten clinically 
depressed patients are randomly assigned to one of two groups. Five patients are assigned to 
Group 1, which is administered the antidepressant drug for a period of six months. The other 
five patients are assigned to Group 2, which is administered a placebo during the same six-month 
period. Assume that prior to introducing the experimental treatments, the experimenter con- 
firmed that the level of depression in the two groups was equal. After six months elapse all ten 
subjects are rated by a psychiatrist (who is blind with respect to a subject's experimental 
condition) on their level of depression. The psychiatrist's depression ratings for the five subjects 
in each group follow (the higher the rating, the more depressed a subject): Group 1: 11, 1,0, 
2, 0; Group 2: 11,11,5,8,4. Do the data indicate that the antidepressant drug is effective? 


III. Null versus Alternative Hypotheses 


Prior to reading the null and alternative hypotheses to be presented in this section, the reader 
should be take note of the following: a) The protocol for the Kolmogorov-Smirnov test for two 
independent samples requires that a cumulative probability distribution be constructed for each 
of the samples. The test statistic is defined by the point that represents the greatest vertical 
distance at any point between the two cumulative probability distributions; and b) Within the 
framework of the null and alternative hypotheses, the notation F(X) represents the population 
distribution from which the j " sample/group is derived. F(X) can also be conceptualized as 
representing the cumulative probability distribution for the population from which the j” 
sample/ group is derived. 


Null hypothesis Hy: F(X) = F,(X) for all values of X 


(The distribution of data in the population that Sample 1 is derived from is consistent with the 
distribution of data in the population that Sample 2 is derived from. Another way of stating 
the null hypothesis is as follows: At no point is the greatest vertical distance between the cumu- 
lative probability distribution for Sample 1 (which is assumed to be the best estimate of the 
cumulative probability distribution of the population from which Sample 1 is derived) and the 
cumulative probability distribution for Sample 2 (which is assumed to be the best estimate of the 
cumulative probability distribution of the population from which Sample 2 is derived) larger than 
what would be expected by chance, if the two samples are derived from the same population 
distribution.) 
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Alternative hypothesis — H;: F(X) + F,(X) for at least one value of X 


(The distribution of data in the population that Sample 1 is derived from is not consistent with 
the distribution of data in the population that Sample 2 is derived from. Another way of stating 
this alternative hypothesis is as follows: There is at least one point where the greatest vertical 
distance between the cumulative probability distribution for Sample 1 (which is assumed to be 
the best estimate of the cumulative probability distribution of the population from which Sample 
1 is derived) and the cumulative probability distribution for Sample 2 (which is assumed to be 
the best estimate of the cumulative probability distribution of the population from which Sample 
2 is derived) is larger than what would be expected by chance, if the two samples are derived 
from the same population distribution. At the point of maximum deviation separating the two 
cumulative probability distributions, the cumulative probability for Sample 1 is either 
significantly greater or less than the cumulative probability for Sample 2. This is a 
nondirectional alternative hypothesis and it is evaluated with a two-tailed test.) 


Or 
H,: F(X) > F,(X) for at least one value of X 


(The distribution of data in the population that Sample 1 is derived from is not consistent with 
the distribution of data in the population that Sample 2 is derived from. Another way of stating 
this alternative hypothesis is as follows: There is at least one point where the greatest vertical 
distance between the cumulative probability distribution for Sample 1 (which is assumed to be 
the best estimate of the cumulative probability distribution of the population from which Sample 
1 is derived) and the cumulative probability distribution for Sample 2 (which is assumed to be 
the best estimate of the cumulative probability distribution of the population from which Sample 
2 is derived) is larger than what would be expected by chance, if the two samples are derived 
from the same population distribution. At the point of maximum deviation separating the two 
cumulative probability distributions, the cumulative probability for Sample 1 is significantly 
greater than the cumulative probability for Sample 2. This is a directional alternative 
hypothesis and it is evaluated with a one-tailed test.) 


Or 
H: F(X) < F,(X) for at least one value of X 


(The distribution of data in the population that Sample 1 is derived from is not consistent with 
the distribution of data in the population that Sample 2 is derived from. Another way of stating 
this alternative hypothesis is as follows: There is at least one point where the greatest vertical 
distance between the cumulative probability distribution for Sample 1 (which is assumed to be 
the best estimate of the cumulative probability distribution of the population from which Sample 
1 is derived) and the cumulative probability distribution for Sample 2 (which is assumed to be 
the best estimate of the cumulative probability distribution of the population from which Sample 
2 is derived) is larger than what would be expected by chance, if the two samples are derived 
from the same population distribution. At the point of maximum deviation separating the two 
cumulative probability distributions, the cumulative probability for Sample 1 is significantly less 
than the cumulative probability for Sample 2. This is a directional alternative hypothesis and 
it is evaluated with a one-tailed test.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected. 
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IV. Test Computations 


As noted in Sections I and III, the test protocol for the Kolmogorov-Smirnov test for two 
independent samples contrasts the two sample cumulative probability distributions with one 
another. Table 13.1 summarizes the steps that are involved in the analysis. There are a total of 
n= 10 scores, with n) = 5 scores in Group 1 and n, = 5 scores in Group 2. 


Table 13.1 Calculation of Test Statistic for Kolmogorov-Smirnov Test 
for Two Independent Samples for Example 13.1 


A B C D E 

X) S(X) QG) SX) SX) - S0 
0,0 2/5= .40 - 0 40- 0 =.40 
1 3/5 - .60 - 0 .60- 0 =.60 
2 45 -..80 - 0 .80- 0 =.80 
- 4/5= .80 4 1/5= .20 .80- .20=.60 
- 4/5 =. 80 5 2/5= .40 .80- .40=.40 
- 45 -..80 8 3/5 - .60 .80—- .60=.20 
11 5/5 = 1.00 11, 11 5/5 = 1.00 1.00 — 1.00 = .00 


The values represented in the columns of Table 13.1 are summarized below. 

The values of the psychiatrist' s depression ratings for the subjects in Group 1 are recorded 
in Column A. Note that there are five scores recorded in Column A, and that if the same score 
is assigned to more than one subject in Group 1, each of the scores of that value are recorded in 
the same row in Column A. 

Each value in Column B represents the cumulative proportion associated with the value of 
the X score recorded in Column A. The notation S (X) is commonly employed to represent the 
cumulative proportions for Group/Sample 1 recorded in Column B. The value in Column B for 
any row is obtained as follows: a) The Group 1 cumulative frequency for the score in that row 
(i.e., the frequency of occurrence of all scores in Group 1 equal to or less than the score in that 
row) is divided by the total number of scores in Group 1 (2, = 5). To illustrate, in the case of 
Row 1, the score 0 is recorded twice in Column A. Thus, the cumulative frequency is equal to 
2, since there are 2 scores in Group 1 that are equal to 0 (a depression rating score cannot be less 
than 0). Thus, the cumulative frequency 2 is divided by n, = 5, yielding 2/5 = .40. The value 
.40 in Column B represents the cumulative proportion in Group 1 associated with a score of 0. 
It means that the proportion of scores in Group 1 that is equal to 0 is .40. The proportion of 
scores in Group 1 that is larger than 0 is .60 (since 1 — .40 2.60). In the case of Row 2, the score 
1 is recorded in Column A. The cumulative frequency is equal to 3, since there are 3 scores in 
Group 1 that are equal to or less than 1 (2 scores of 0 and a score of 1). Thus, the cumulative 
frequency 3 is divided by n, = 5, yielding 3/5 2.60. The value .60 in Column B represents the 
cumulative proportion in Group 1 associated with a score of 1. It means that the proportion of 
scores in Group 1 that is equal to or less than 1 is .60. The proportion of scores in Group 1 that 
is larger than 1 is .40 (since 1 — .60 2.40). In the case of Row 3, the score 2 is recorded in 
Column A. The cumulative frequency is equal to 4, since there are 4 scores in Group 1 that are 
equal to or less than 2 (two scores of 0, a score of 1, and a score of 2). Thus, the cumulative 
frequency 4 is divided by n, = 5, yielding 4/5 =.80. The value .80 in Column B represents the 
cumulative proportion in Group 1 associated with a score of 2. It means that the proportion of 
scores in Group 1 that is equal to or less than 2 is .80. The proportion of scores in Group 1 that 
is larger than 2 is .20 (since 1 — .80 = .20). Note that the value of the cumulative proportion in 
Column B remains .8 in Rows 4, 5, and 6, since until a new score is recorded in Column A, the 
cumulative proportion recorded in Column B will remain the same. In the case of Row 7, the 
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score 11 is recorded in Column A. The cumulative frequency is equal to 5, since there are 5 
scores in Group 1 that are equal to or less than 11 (i.e., all of the scores in Group 1 are equal to 
or less than 11). Thus, the cumulative frequency 5 is divided by n, = 5, yielding 5/5 = 1. The 
value 1 in Column B represents the cumulative proportion in Group 1 associated with a score 
of 11. It means that the proportion of scores in Group 1 that is equal to or less than 11 is 1. The 
proportion of scores in Group 1 that is larger than 11 is 0 (since 1 — 1 = 0). 

The values of the psychiatrist' s depression ratings for the subjects in Group 2 are recorded 
in Column C. Note that there are five scores recorded in Column C, and if the same score is 
assigned to more than one subject in Group 2, each of the scores of that value are recorded in the 
same row in Column C. 

Each value in Column D represents the cumulative proportion associated with the value 
of the X score recorded in Column C. The notation S,(X) is commonly employed to represent 
the cumulative proportions for Group/Sample 2 recorded in Column D. The value in Column 
D for any row is obtained as follows: a) The Group 2 cumulative frequency for the score in that 
row (i.e., the frequency of occurrence of all scores in Group 2 equal to or less than the score in 
that row) is divided by the total number of scores in Group 2 (n, = 5). To illustrate, in the case 
of Rows 1, 2, and 3, no score is recorded in Column C. Thus, the cumulative frequencies for 
each of those rows is equal to 0, since up to that point in the analysis there are no scores recorded 
for Group 2. Consequently, for each of the first three rows, the cumulative frequency 0 is 
divided by n, = 5, yielding 0/5 = 0. In each of the first three rows, the value 0 in Column D 
represents the cumulative proportion for Group 2 up to that point in the analysis. For each of 
those rows, the proportion of scores in Group 2 that remain to be analyzed is 1 (since 1 - 0 = 1). 
In the case of Row 4, the score 4 is recorded in Column C. The cumulative frequency is equal 
to 1, since there is 1 score in Group 2 that is equal to or less than 4 (i.e., the score 4 in that row). 
Thus, the cumulative frequency 1 is divided by n, = 5, yielding 1/5 =.20. The value .20 in 
Column D represents the cumulative proportion in Group 2 associated with a score of 4. It 
means that the proportion of scores in Group 2 that is equal to or less than 4 is .20. The 
proportion of scores in Group 2 that is larger than 4 is .80 (since 1 — .20 2.80). In the case of 
Row 5, the score 5 is recorded in Column C. The cumulative frequency is equal to 2, since there 
are 2 scores in Group 2 that are equal to or less than 5 (the scores of 4 and 5). Thus, the 
cumulative frequency 2 is divided by n, = 5, yielding 2/5 = .40. The value .40 in Column D 
represents the cumulative proportion in Group 2 associated with a score of 5. It means that the 
proportion of scores in Group 2 that is equal to or less than 5 is .40. The proportion of scores in 
Group 2 that is larger than 5 is .60 (since 1 — .40 2.60). In the case of Row 6, the score 8 is 
recorded in Column C. The cumulative frequency is equal to 3, since there are 3 scores in 
Group 2 that are equal to or less than 8 (the scores of 4, 5, and 8). Thus, the cumulative 
frequency 3 is divided by n, = 5, yielding 3/5 =.60. The value .60 in Column D represents the 
cumulative proportion in Group 2 associated with a score of 8. It means that the proportion of 
scores in Group 2 that is equal to or less than 8 is .60. The proportion of scores in Group 2 that 
is larger than 8 is .40 (since 1 2.60 2.40). In the case of Row 7, the score 11 is recorded twice 
in Column C. The cumulative frequency is equal to 5, since there are 5 scores in Group 2 that 
are equal to or less than 11 (1.e., all of the scores in Group 2 are equal to or less than 11). Thus, 
the cumulative frequency 5 is divided by n, = 5, yielding 5/5 = 1. The value 1 in Column D 
represents the cumulative proportion in Group 2 associated with a score of 11. It means that the 
proportion of scores in Group 2 that is equal to or less than 11 is 1. The proportion of scores in 
Group 2 that is larger than 11 is 0 (since 1 — 1 = 0). 

The values in Column E are difference scores between the cumulative proportions recorded 
in Row B for Group 1 and Row D for Group 2. Thus, for Row 1 the entry in Column E is .40, 
which represents the Column B cumulative proportion of .40 for Group 1, minus 0, which 
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represents the Column D cumulative proportion for Group 2. For Row 2 the entry in Column 
E is .60, which represents the Column B cumulative proportion of .60 for Group 1, minus 0, 
which represents the Column D cumulative proportion for Group 2. The same procedure is 
employed with the remaining five rows in the table. 

As noted in Section III, the test statistic for the Kolmogorov-Smirnov test for two 
independent samples is defined by the greatest vertical distance at any point between the two 
cumulative probability distributions. The largest absolute value obtained in Column E will 
represent the latter value. The notation M will be employed for the test statistic. In Table 13.1 
the largest absolute value is .80 (which is recorded in Row 3). Therefore, M = .80.! 


V. Interpretation of the Test Results 


The test statistic for the Kolmogorov-Smirnov test for two independent samples is evaluated 
with Table A23 (Table of Critical Values for the Kolmogorov-Smirnov test for two inde- 
pendent samples) in the Appendix. If, at any point along the two cumulative probability 
distributions, the greatest distance (1.e., the value of M) is equal to or greater than the tabled 
critical value recorded in Table A23, the null hypothesis is rejected. The critical values in Table 
A23 are listed in reference to the values of n, and n,. For n, - 5 and n, = 5, the tabled 
critical two-tailed .05 and .01 values in are M, = .800 and M as .800, and the tabled critical 
one-tailed .05 and .01 values are My, = -600 and M,, = .800? 

The following guidelines are employed in evaluating the null hypothesis for the 
Kolmogorov-Smirnov test for two independent samples. 

a) If the nondirectional alternative hypothesis H,: F(X) + F,(X) is employed, the null 
hypothesis can be rejected if the computed value of the test statistic is equal to or greater than the 
tabled critical two-tailed M value at the prespecified level of significance. 

b) If the directional alternative hypothesis H,: F(X) > F,(X) is employed, the null 
hypothesis can be rejected if the computed value of the test statistic is equal to or greater than the 
tabled critical one-tailed M value at the prespecified level of significance. Additionally, the 
difference between the two cumulative probability distributions must be such that in reference 
to the point that represents the test statistic, the cumulative probability for Sample 1 must be 
larger than the cumulative probability for Sample 2. 

c) If the directional alternative hypothesis H,: F(X) < F(X) is employed, the null hy- 
pothesis can be rejected if the computed value of the test statistic is equal to or greater than the 
tabled critical one-tailed M value at the prespecified level of significance. Additionally, the 
difference between the two cumulative probability distributions must be such that in reference 
to the point that represents the test statistic, the cumulative probability for Sample 1 must be less 
than the cumulative probability for Sample 2. 

The above guidelines will now be employed in reference to the computed test statistic 
M = .80. 

a) If the nondirectional alternative hypothesis H,: F(X) + F,(X) is employed, the null 
hypothesis can be rejected at both the .05 and .01 levels, since M = .80 is equal to the tabled 
critical two-tailed values M,, = .800 and M,, = .800. 

b) If the directional alternative hypothesis H,: F(X) > F,(X) is employed, the null 
hypothesis can be rejected at both the .05 and .01 levels, since M = .80 is greater than or equal 
to the tabled critical one-tailed values M y, = .600 and M = .800. Additionally, since in 
Row3 of Table 13.1 [$,(X) = .80] > [S,(X) = 0], the data are consistent with the alternative 
hypothesis H,: F(X) > F,(X). In other words, for the computed value of M, the cumulative 
proportion for Sample 1 is larger than the cumulative proportion for Sample 2. 

c) If the directional alternative hypothesis H: F\(X) < F(X) is employed, the null 
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hypothesis cannot be rejected, since in order for the latter alternative hypothesis to be supported, 
for the computed value of M, the cumulative proportion for Sample 2 must be larger than the 
cumulative proportion for Sample 1 (which is not the case in Row 3 of Table 13.1). 

A summary of the analysis of Example 13.1 with the Kolmogorov-Smirnov test for two 
independent samples follows: It can be concluded that there is a high likelihood the two groups 
are derived from different populations. More specifically, the data indicate that the depression 
ratings for Group 1 (i.e., the group that receives the antidepressant medication) are significantly 
less than the depression ratings for Group 2 (the placebo group). 

When the same set of data is evaluated with the ¢ test for two independent samples and 
the Mann-Whitney U test (i.e., Examples 11.1/12.1), in the case of both of the latter tests, the 
null hypothesis can only be rejected (and only at the .05 level) if the researcher employs a 
directional alternative hypothesis that predicts a lower level of depression for Group 1. The latter 
result is consistent with the result obtained with the Kolmogorov-Smirnov test, in that the 
directional alternative hypothesis H,: F(X) > F,(X) is supported. Note, however, that the 
latter directional alternative hypothesis is supported at both the .05 and .01 levels when the 
Kolmogorov-Smirnov test is employed. In addition, the nondirectional alternative hypothesis 
is supported at both the .05 and .01 levels with the Kolmogorov-Smirnov test, but is not 
supported when the ¢ test and Mann-Whitney U test are used. Although the results obtained 
with the Kolmogorov-Smirnov test for two independent samples are not identical with the 
results obtained with the ¢ test for two independent samples and the Mann-Whitney U test, 
they are reasonably consistent. 

It should be noted that in most instances the Kolmogorov-Smirnov test for two inde- 
pendent samples and the ¢ test for two independent samples are employed to evaluate the same 
set of data, the Kolmogorov-Smirnov test will provide a less powerful test of an alternative 
hypothesis. Thus, although it did not turn out to be the case for Examples 11.1/13.1, if a sig- 
nificant difference is present, the ¢ test will be the more likely of the two tests to detect it. Siegel 
and Castellan (1988) note that when compared with the f test for two independent samples, the 
Kolmogorov-Smirnov test has a power efficiency (which is defined in Section VII of the 
Wilcoxon signed-ranks test (Test 6)) of .95 for small sample sizes, and a slightly lower power 
efficiency for larger sample sizes. 


VI. Additional Analytical Procedures for the Kolmogorov-Smirnov 
Test for Two Independent Samples 


1. Graphical method for computing the Kolmogorov-Smirnov test statistic Conover (1980, 
1999) employs a graphical method for computing the Kolmogorov-Smirnov test statistic that 
is based on the same logic as the graphical method that is briefly discussed for computing the test 
statistic for the Kolmogorov-Smirnov goodness-of-fit test for a single sample. The method 
involves constructing a graph of the cumulative probability distribution for each sample and 
measuring the point of maximum distance between the two cumulative probability distributions. 
Daniel (1990) describes a graphical method that employs a graph referred to as a pair chart as 
an alternative way of computing the Kolmogorov-Smirnov test statistic. The latter method is 
attributed to Hodges (1958) and Quade (1973) (who cites Drion (1952) as having developed the 
pair chart). 


2. Computing sample confidence intervals for the Kolmogorov-Smirnov test for two in- 
dependent samples The same procedure that is described for computing a confidence interval 
for cumulative probabilities for the sample distribution that is evaluated with the Kolmogorov- 
Smirnov goodness-of-fit test for a single sample can be employed to compute a confidence 
interval for cumulative probabilities for either one of the samples that are evaluated with the 
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Kolmogorov-Smirnov test for two independent samples. Specifically, Equation 7.1 is 
employed to compute the upper and lower limits for each of the points in a confidence interval. 
Thus, for each sample, M, is added to and subtracted from each of the S(X ) values. Note that 
the value of M, employed in constructing a confidence interval for each of the samples is 
derived from Table A21 (Table of Critical Values for the Kolmogorov-Smirnov Goodness- 
of-Fit Test for a Single Sample) in the Appendix. Thus, if one is computing a 95% confi- 
dence interval for each of the samples, the tabled critical two-tailed value M 55 = .563 for 
n, = n = Ny = 5 is employed to represent M in Equation 7.1. 

Note the notation S(X ) is used to represent the points on a cumulative probability distri- 
bution for the Kolmogorov-Smirnov test for two independent samples, while the notation 
S(X;) is used to represent the points on the cumulative probability distribution for the sample 
evaluated with the Kolmogorov-Smirnov goodness-of-fit test for a single sample. In the case 
of the latter test, there is only one sample for which a confidence interval can be computed, while 
in the case of the Kolmogorov-Smirnov test for two independent samples, a confidence 


interval can be constructed for each of the independent samples. 


3. Large sample chi-square approximation for a one-tailed analysis for the Kolmogorov- 
Smirnov test for two independent samples Siegel and Castellan (1988) note that Goodman 
(1954) has shown that Equation 13.1 (which employs the chi-square distribution) can provide 
a good approximation for large sample sizes when a one-tailed/nondirectional alternative 
hypothesis is evaluated.? 


2 j 7h75 . 
X = 4M] ——— (Equation 13.1) 
n +n, 


The computed value of chi-square is evaluated with Table A4 (Table of the Chi-Square 
Distribution) in the Appendix. The degrees of freedom employed in the analysis will always 
be df=2. The tabled critical one-tailed .05 and .01 chi-squared values in Table A4 for df = 2 
are Xos = 5.99 and Xo = 9.21. If the computed value of chi-square is equal to or greater than 
either of the aforementioned values, the null hypothesis can be rejected at the appropriate level 
of significance (i.e., the directional alternative hypothesis that is consistent with the data will be 
supported). Although our sample size is too small for the large sample approximation, for 
purposes of illustration we will use it. When the appropriate values for Example 13.1 are sub- 
stituted in Equation 13.1, the value X? = 6.4 is computed. Since y? = 6.4 is larger than 
Xs = 5.99 but less than Xo = 9.21, the null hypothesis can be rejected, but only at the .05 
level. Thus, the directional alternative hypothesis H,: F(X) > F(X) is supported at the .05 
level. Note than when the tabled critical values in Table A23 are employed, the latter alternative 
hypothesis is also supported at the .01 level. The latter is consistent with the fact that Siegel and 
Castellan (1988) note that when Equation 13.1 is employed with small samples sizes, it tends to 
yield a conservative result (i.e., it is less likely to reject a false null hypothesis). 
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VII. Additional Discussion of the Kolmogorov-Smirnov Test 
for Two Independent Samples 


1. Additional comments on the Kolmogorov-Smirnov test for two independent samples 
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a) Daniel (1990) states that if for both populations a continuous dependent variable is evaluated, 
the Kolmogorov-Smirnov test for two independent samples yields exact probabilities. He 
notes, however, that Noether (1963, 1967) has demonstrated that if a discrete dependent variable 
is evaluated, the test tends to be conservative (i.e., is less likely to reject a false null hypothesis); 
b) Sprent (1993) notes that the Kolmogorov-Smirnov test for two independent samples may 
not be as powerful as tests that focus on whether or not there is a difference on a specific dis- 
tributional characteristic such as a measure of central tendency and/or variability. Siegel and 
Castellan (1988) state that the Kolmogorov-Smirnov test for two independent samples is more 
powerful than the chi-square test for r x c tables (Test 16) and the median test for independent 
samples (Test 16e). They also note that for small sample sizes, the Kolmogorov-Smirnov test 
has a higher power efficiency than the Mann-Whitney U test, but as the sample size increases 
the opposite becomes true with regard to the power efficiency of the two tests; and c) Conover 
(1980, 1999) and Hollander and Wolfe (1999) provide a more detailed discussion of the theory 
underlying the Kolmogorov-Smirnov test for two independent samples. 


VIII. Additional Examples Illustrating the Kolmogorov-Smirnov 
Test for Two Independent Samples 


Since Examples 11.4 and 11.5 in Section VIII ofthet test for two independent samples employ 
the same data as Example 13.1, the Kolmogorov—Smirnov test for two independent samples 
will yield the same result if employed to evaluate the latter two examples. In addition, the 
Kolmogorov-Smirnov test can be employed to evaluate Examples 11.2 and 11.3. Since 
different data are employed in the latter examples, the result obtained with the Kolmogorov- 
Smirnov test will not be the same as that obtained for Example 13.1. Example 11.2 is evaluated 
below with the Kolmogorov-Smirnov test for two independent samples. Table 13.2 sum- 
marizes the analysis. 


Table 13.2 Calculation of Test Statistic for Kolmogorov-Smirnov Test 
for Two Independent Samples for Example 11.2 


A B C D E 
X) SX) X) SX) SX) - SX) 

- 0 7 1/5- .20 0-.20 -—20 

8 1/5- .20 8.8 3/5- .60 20-.60 -—40 

9 2/5- 40 9,9 5/5 = 1.00 40 — 1.00 = -.60 
10, 10 4/5 = .80 - 5/5 = 1.00 .80 — 1.00 = —20 

11 5/5 = 1.00 - 5/5 = 1.00 1.00- 1.00 = .00 


The obtained value of test statistic is M = .60, since .60 is the largest absolute value for a 
difference score recorded in Column E of Table 13.2. Since n, = 5 and n, = 5, we employ 
the same critical values used in evaluating Example 13.1. If the nondirectional alternative hy- 
pothesis H,: F(X) + F,(X) is employed, the null hypothesis cannot be rejected at the .05 level, 
since M = .60 is less than the tabled critical two-tailed value M ,, = .800. The data are consistent 
with the directional alternative hypothesis H,: F(X) < F,(X) since in Row 3 of Table 13.2 
[S (X) = .40] < [S,(X) = 1]. In other words, for the computed value of M, the cumulative 
proportion for Sample 2 is larger than the cumulative proportion for Sample 1. The directional 
alternative hypothesis H,: F(X) < F,(X) is supported at the .05 level, since M = .60 is equal 
to the tabled critical one-tailed value M y, = .600. Itis not, however, supported at the .01 level, 
since M = .60 is less than the tabled critical one-tailed value M y, = .800. The directional alter- 
native hypothesis H,: F(X) > F(X) is not supported, since it is not consistent with the data. 
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When the null hypothesis Hy: 4, = p, is evaluated with the ¢ test for two independent 
samples, the only alternative hypothesis which is supported (but only at the .05 level) is the 
directional alternative hypothesis H,: p, > p. The latter result (indicating higher scores in 
Group 1) is consistent with the result that is obtained when the Kolmogorov-Smirnov test for 


two independent samples is employed to evaluate the same set of data. 
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Endnotes 


1. Marasucilo and McSweeney (1977) employ a modified protocol that can result in a larger 
absolute value for M in Column E than the one obtained in Table 13.1. The latter protocol 
employs a separate row for the score of each subject when the same score occurs more than 
once within a group. If the latter protocol is employed in Table 13.1, the first two rows of 
the table will have the score of 0 in ColumnA for the two subjects in Group 1 who obtain that 
score. The first 0 will be in the first row, and have a cumulative proportion in Column B of 
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1/5 = .20. The second 0 will be in the second row, and have a cumulative proportion in 
Column B of 2/5 = .40. In the same respect the first of the two scores of 11 (obtained by two 
subjects in Group 2) will be in a separate row in Column C, and have a cumulative 
proportion in Column D of 4/5 = .80. The second score of 11 will be in the last row of the 
table, and have a cumulative proportion in Column D of 5/5 = 1. In the case of Example 
13.1, the outcome of the analysis will not be affected if the aforementioned protocol is 
employed. In some instances, however, it can result in a larger M value. The protocol 
employed by Marasucilo and McSweeney (1977) is used by sources who argue that when 
there are ties present in the data (i.e., the same score occurs more than once within a group), 
the protocol described in this chapter (which is used in most sources) results in an overly 
conservative test (1.e., makes it more difficult to reject a false null hypothesis). 


2. When the values of n; and n, are small, some of the .05 and .01 critical values listed in 
Table A23 are identical to one another. 


3. The last row in Table A23 can also be employed to compute a critical M value for large 
sample sizes. 
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Test 14 


The Siegel-Tukey Test for Equal Variability 
(Nonparametric Test Employed with Ordinal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Do two independent samples represent two populations with 
different variances? 


Relevant background information on test Developed by Siegel and Tukey (1960), the 
Siegel-Tukey test for equal variability is employed with ordinal (rank-order) data in a 
hypothesis testing situation involving two independent samples. If the result of the Siegel- Tukey 
test for equal variability is significant, it indicates there is a significant difference between the 
sample variances, and as a result of the latter the researcher can conclude there is a high likelihood 
that the samples represent populations with different variances. 

The Siegel-Tukey test for equal variability is one of a number of tests of dispersion (also 
referred to as tests of scale or spread) that have been developed for contrasting the variances of 
two independent samples. A discussion of alternative nonparametric tests of dispersion can be 
found in Section VII. Some sources recommend the use of nonparametric tests of dispersion for 
evaluating the homogeneity of variance hypothesis when there is reason to believe that the nor- 
mality assumption of the appropriate parametric test for evaluating the same hypothesis is 
violated. Sources that are not favorably disposed toward nonparametric tests recommend the use 
of Hartley's F nax test for homogeneity of variance/F test for two population variances (Test 
11a) (or one of the alternative parametric tests that are available for evaluating homogeneity of 
variance), regardless of whether or not the normality assumption of the parametric test is 
violated. Such sources do, however, recommend that in employing a parametric test, a researcher 
employ a lower significance level to compensate for the fact that violation of the normality 
assumption can inflate the Type I error rate associated with the test. When there is no evidence 
to indicate that the normality assumption of the parametric test has been violated, sources are in 
general agreement that such a test is preferable to the Siegel-Tukey test for equal variability 
(or an alternative nonparametric test of dispersion), since a parametric test (which uses more 
information than a nonparametric test) provides a more powerful test of an alternative hypothesis. 

Since nonparametric tests are not assumption free, the choice of which of the available tests 
of dispersion to employ will primarily depend on what assumptions a researcher is willing to 
make with regard to the underlying distributions represented by the sample data. The 
Siegel-Tukey test for equal variability is based on the following assumptions: a) Each sample 
has been randomly selected from the population it represents; b) The two samples are independent 
of one another; c) The level of measurement the data represent is at least ordinal; and d) The two 
populations from which the samples are derived have equal medians. If the latter assumption is 
violated, but the researcher does know the values of the population medians, the scores in the 
groups can be adjusted so as to allow the use of the Siegel- Tukey test for equal variability. 
When, however, the population medians are unknown, and one is unwilling to assume they are 
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equal, the Siegel-Tukey test for equal variability is not the appropriate nonparametric test of 
dispersion to employ. The assumption of equality of population medians presents a practical 
problem, in that when evaluating two independent samples a researcher will often have no prior 
knowledge regarding the population medians. In point of fact, most hypothesis testing addresses 
the issue of whether or not the medians (or means) of two or more populations are equal. In view 
of this, sources such as Siegel and Castellan (1988) note that if the latter values are not known, 
it is not appropriate to estimate them with the sample medians. 

In employing the Siegel-Tukey test for equal variability, one of the following is true with 
regard to the rank-order data that are evaluated: a) The data are in a rank-order format, since it 
is the only format in which scores are available; or b) The data have been transformed to a rank- 
order format from an interval/ratio format, since the researcher has reason to believe that the 
normality assumption of the analogous parametric test is saliently violated. 


II. Example 


Example 14.1 /n order to assess the effect of two antidepressant drugs, 12 clinically depressed 
patients are randomly assigned to one of two groups. Six patients are assigned to Group 1, 
which is administered the antidepressant drug Elatrix for a period of six months. The other six 
patients are assigned to Group 2, which is administered the antidepressant drug Euphryia during 
the same six-month period. Assume that prior to introducing the experimental treatments, the 
experimenter confirmed that the level of depression in the two groups was equal. After six 
months elapse, all 12 subjects are rated by a psychiatrist (who is blind with respect to a subject's 
experimental condition) on their level of depression. The psychiatrist's depression ratings for 
the six subjects in each group follow (the higher the rating, the more depressed a subject): 
Group 1: 10, 10, 9, 1, 0, 0; Group 2: 6,6,5,5,4,4. 

The fact that the mean and median of each group are equivalent (specifically, both values 
equal 5) is consistent with prior research which suggests that there is no difference in efficacy 
for the two drugs (when the latter is based on a comparison of group means and/or medians). 
Inspection of the data does suggest, however, that there is much greater variability in the 
depression scores of subjects in Group 1. To be more specific, the data suggest that the drug 
Elatrix may, in fact, decrease depression in some subjects, yet increase it in others. The re- 
searcher decides to contrast the variability within the two groups through use of the Siegel- 
Tukey test for equal variability. The use of the latter nonparametric test is predicated on the 
fact that there is reason to believe that the distributions of the posttreatment depression scores 
in the underlying populations are not normal (which is why the researcher is reluctant to 
evaluate the data with Hartley's F „a test for homogeneity of variance/F test for two population 
variances). Do the data indicate there is a significant difference between the variances of the 
two groups? 


III. Null versus Alternative Hypotheses 


Null hypothesis Hy: o; = 0 


(The variance of the population Group 1 represents equals the variance of the population Group 
2 represents. With respect to the sample data, when both groups have an equal sample size, this 
translates into the sum of the ranks of Group 1 being equal to the sum of the ranks of Group 2 
(i.e, ER, = LR,). A more general way of stating this, which also encompasses designs 
involving unequal sample sizes, is that the means of the ranks of the two groups are equal (i.e., 
R, = R). 
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Alternative hypothesis H: o; + 0 


(The variance of the population Group 1 represents does not equal the variance of the population 
Group 2 represents. With respect to the sample data, when both groups have an equal sample 
size, this translates into the sum of the ranks of Group 1 not being equal to the sum of the ranks 
of Group 2 (i.e., XR, + XR,). A more general way of stating this, which also encompasses 
designs involving unequal sample sizes, is that the means of the ranks of the two groups are not 
equal (ie., R, # R,). This is a nondirectional alternative hypothesis and it is evaluated with 
a two-tailed test.) 


or 
2 2 
H: 0; > 0, 


(The variance of the population Group 1 represents is greater than the variance of the population 
Group 2 represents. With respect to the sample data, when both groups have an equal sample 
size (so long as a rank of 1 is given to the lowest score), this translates into the sum of the ranks 
of Group 1 being less than the sum of the ranks of Group 2 (i.e., LR, < XR,). A more general 
way of stating this, which also encompasses designs involving unequal sample sizes, is that the 
mean of the ranks of Group 1 is less than the mean of the ranks of Group 2 (i.e., R} < R,). This 
is a directional alternative hypothesis and it is evaluated with a one-tailed test.) 


or 


2 


2 
H: 0; < 0; 


(The variance of the population Group 1 represents is less than the variance of the population 
Group 2 represents. With respect to the sample data, when both groups have an equal sample size 
(so long as a rank of 1 is given to the lowest score), this translates into the sum of the ranks of 
Group 1 being greater than the sum of the ranks of Group 2 (i.e., XR, > XR, ). A more general 
way of stating this, which also encompasses designs involving unequal sample sizes, is that the 
mean of the ranks of Group 1 is greater than the mean of the ranks of Group 2 (i.e., R, > R,). 
This is a directional alternative hypothesis and it is evaluated with a one-tailed test.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected. 


IV. Test Computations 


The total number of subjects employed in the experiment is N= 12. There are n, = 6 subjects 
in Group 1 and n, = 6 subjects in Group 2. The data for the analysis are summarized in Table 
14.1. The original interval/ratio scores of the subjects are recorded in the columns labelled X, 
and X,. The adjacent columns A, and R, contain the rank-order assigned to each of the scores. 
The rankings for Example 14.1 are summarized in Table 14.2. The ranking protocol for the 
Siegel-Tukey test for equal variability is described in this section. Note that in Table 14.1 and 
Table 14.2 each subject's identification number indicates the order in Table 14.1 in which a 
subject's score appears in a given group, followed by his/her group. Thus, Subject i, j is the 
i " subject in Group j. 

The computational procedure for the Siegel-Tukey test for equal variability is identical 
to that employed for the Mann-Whitney U test (Test 12), except for the fact that the two 
tests employ a different ranking protocol. Recollect that in the description of the alternative 
hypotheses for the Siegel- Tukey test for equal variability, it is noted that when a directional 
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Table 14.1 Data for Example 14.1 


Group 1 Group 2 
X, R, X, R, 
Subject 1,1 10 2:5 Subject 1,2 6 8.5 
Subject 2,1 10 2:5 Subject 2,2 6 8.5 
Subject 3,1 9 6 Subject 3,2 5 11.5 
Subject 4,1 1 5 Subject 4,2 5 11.5 
Subject 5,1 0 25 Subject 5,2 4 8.5 
Subject 6,1 0 2.5 Subject 6,2 4 8.5 
ER = 21 ER, = 57 

= YR = YR 

| Ma UNE MEE R- -3l -95 
n, 6 n, 6 


Table 14.2 Rankings for the Siegel- Tukey Test for Equal Variability for Example 14.1 
Subject identification number 5,1 6,1 4,1 52 62 32 42 12 22 34 1,1 2,1 
Depression score 0 0 1 4 4 5 5 6 6 9 10 10 
Rank prior to tie adjustment 1 4 5 8 9 12 IlI 10 7 6 3 2 
Tie-adjusted rank 25 25 5 85 85 115 11.5 85 85 6 25 25 


alternative hypothesis is supported, the average of the ranks of the group with the larger variance 
will be less than the average of the ranks of the group with the smaller variance. On the other 
hand, when a directional hypothesis for the Mann-Whitney U test is supported, the average of 
the ranks of the group with the larger median will be greater than the average of the ranks of the 
group with the smaller median. The difference between the two tests with respect to the ordinal 
position of the average ranks reflects the fact that the tests employ different ranking protocols. 
Whereas the ranking protocol for the Mann-Whitney U test is designed to identify differences 
with respect to central tendency (specifically, the median values), the ranking protocol for the 
Siegel- Tukey test for equal variability is designed to identify differences with respect to 
variability. The ranking protocol for the Siegel- Tukey test for equal variability is based on 
the premise that within the overall distribution of N scores, the distribution of scores in the group 
with the higher variance will contain more extreme values (i.e., scores that are very high and 
scores that are very low) than the distribution of scores in the group with the lower variance. 

The following protocol, which is summarized in Table 14.2, is used in assigning ranks. 

a) All N= 12 scores are arranged in order of magnitude (irrespective of group membership), 
beginning on hé left with the lowest score and moving to the right as scores increase. This is 
done in the second row of Table 14.2. 

b) Ranks are now assigned in the following manner: A rank of 1 is assigned to the lowest 
score (0). A rank of 2 is assigned to the highest score (10), and a rank of 3 is assigned to the 
second highest score (10). A rank of 4 is assigned to the second lowest score (0), and a rank of 
5 is assigned to the third lowest score (1). A rank of 6 is assigned to the third highest score (9), 
and a rank of 7 is assigned to the fourth highest score (6). A rank of 8 is assigned to the fourth 
lowest score (4), and a rank of 9 is assigned to the fifth lowest score (4). A rank of 10 is assigned 
to the fifth highest score (6), and a rank of 11 is assigned to the sixth highest score (5). A rank 
of 12 is assigned to the sixth lowest score (5). Note that the ranking protocol assigns ranks to 
the distribution of N = 12 scores by alternating from one extreme of the distribution to the other. 
The ranks assigned employing this protocol are listed in the third row of Table 14.2 
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c) The ranks in the third row of Table 14.2 must be adjusted when there are tied scores 
present in the data. The same procedure for handling ties that is described for the Mann- 
Whitney U testis also employed with for Siegel- Tukey test for equal variability. Specifically, 
in instances where two or more subjects have the same score, the average of the ranks involved 
is assigned to all scores tied for a given rank. This adjustment is made in the fourth row of 
Table 14.2. To illustrate: Both Subjects 5,1 and 6,1 have a score of 0. Since the ranks assigned 
to the scores of these two subjects are, respectively, 1 and 4, the average of the two ranks 
(1 + 4)/2 = 2.5 is assigned to the score of both subjects. Both Subjects 1,1 and 2,1 have a 
score of 10. Since the ranks assigned to the score of these two subjects are, respectively, 2 and 
3, the average of the two ranks (2 + 3)/2 = 2.5 is assigned to the score of both subjects. For 
the remaining three sets of ties (which all happen to fall in Group 2) the same averaging 
procedure is employed. 

It should be noted that in Example 14.1 each set of tied scores involves subjects who are 
in the same group. Any time each set of ties involves subjects in the same group, the tie 
adjustment will result in the identical sum and average for the ranks of the two groups that will 
be obtained if the tie adjustment is not employed. Because of this, under these conditions the 
computed test statistic will be identical regardless of whether or not one uses the tie adjustment. 
On the other hand, when one or more sets of ties involve subjects from both groups, the tie- 
adjusted ranks will yield a value for the test statistic that is different from that which will be 
obtained if the tie adjustment is not employed. 

It should be noted that it is permissible to reverse the ranking protocol described in this 
section. Specifically, one can assign a rank of 1 to the highest score, a rank of 2 to the lowest 
score, a rank of 3 to the second lowest score, a rank of 4 to the second highest score, a rank of 
5 to the third highest score, and so on. This reverse-ranking protocol will result in the same test 
statistic and, consequently, the same conclusion with respect to the null hypothesis as the ranking 
protocol described in this section.! 

Once all of the subjects have been assigned a rank, the sum of the ranks for each of the 
groups is computed. These values, YR, - 21 and XR, = 57,arecomputed in Table 14.1. Upon 
determining the sum of the ranks for both groups, the values U, and U, are computed using 
Equations 12.1 and 12.2, which are employed for the Mann-Whitney U test. The basis for 
employing the same equations and the identical distribution as that used for the Mann-Whitney 
U test is predicated on the fact that both tests employ the same sampling distribution. 


n(n +1 

U, = mn, + A? sR, = © + 86D - 21 = 36 
n(n, + 1 

U, = nn, n t D a ) YR, = (66 987 B esr eg 


Note that U, and U, can never be negative values. If a negative value is obtained for 
either, it indicates an error has been made in the rankings and/or calculations. 

As is the case for the Mann-Whitney U test, Equation 12.3 can be employed to verify the 
calculations. If the relationship in Equation 12.3 is not confirmed, it indicates that an error has 
been made in ranking the scores or in the computation of the U values. The relationship 
described by Equation 12.3 is confirmed below for Example 14.1. 


nn, = U, + U, 


(6)(6) = 36 + 0 = 36 
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V. Interpretation of the Test Results 


The smaller of the two values U, versus U, is designated as the obtained U statistic. Since 
U, = 0 is smaller than U, = 36, the value of U 2 0. The value of U is evaluated with Table 
A11 (Table of Critical Values for the Mann-Whitney U Statistic) in the Appendix. In order 
to be significant, the obtained value of U must be equal to or less than the tabled critical value 
at the prespecified level of significance. For n, = 6 and n, = 6, the tabled critical two-tailed 
values are U,, = 5 and U,, = 2, and the tabled critical one-tailed values are U, = 7 and 
U y= 3. 

Since the obtained value U = 0 is less than the tabled critical two-tailed values U,, = 5 
and Uy, = 2,thenondirectional alternative hypothesis H,: o? * o is supported at both the .05 
and .01 levels. Since the obtained value of U is less than the tabled critical one-tailed values 
U = 7 and U = 3, the directional alternative hypothesis H;: o? > o is also supported at 
both the .05 and .01 levels. The latter directional alternative hypothesis is supported since 
R, < R,, which indicates that the variability of scores in Group 1 is greater than the vari- 
ability of scores in Group 2. The directional alternative hypothesis H;: o < o is not 
supported, since in order for the latter alternative hypothesis to be supported, R , must be greater 
than R, (which indicates that the variability of scores in Group 2 is greater than the variability 
of scores in Group 1). 

Based on the results of the Siegel- Tukey test for equal variability, the researcher can 
conclude that there is greater variability in the depression scores of the group that receives the 
drug Elatrix (Group 1) than the group that receives the drug Euphyria (Group 2). 


VI. Additional Analytical Procedures for the Siegel- Tukey Test 
for Equal Variability and/or Related Tests 


1. The normal approximation of the Siegel- Tukey test statistic for large samplesizes As 
is the case with the Mann-Whitney U test, the normal distribution can be employed with 
large sample sizes to approximate the Siegel- Tukey test statistic. Equation 12.4, which is em- 
ployed for the large sample approximation of the Mann-Whitney distribution, can also be em- 
ployed for the large sample approximation of Siegel- Tukey test statistic. As is noted in Section 
VI of the Mann-Whitney U test, the large sample approximation is generally used for sample 
sizes larger than those documented in the exact table contained within the source one is em- 
ploying. 

In the discussion of the Mann-Whitney U test, it is noted that the term (n,n,)/2 in the 
numerator of Equation 12.4 represents the expected (mean) value of U if the null hypothesis is 
true. This is also the case when the normal distribution is employed to approximate the Siegel- 
Tukey test statistic. Thus, if the two population variances are in fact equal, it is expected that 
R, = R, and, consequently, U, = U, = (n,n,)/2. 

Although Example 14.1 involves only N = 12 scores (a value most sources would view as 
too small to use with the normal approximation), it will be employed to illustrate Equation 12.4. 
The reader will see that in spite of employing Equation 12.4 with a small sample size, it yields 
a result that is consistent with the result obtained when the exact table for the Mann-Whitney 
U distribution is employed. As is noted in Section VI of the Mann-Whitney U test, since the 
smaller of the two values U, versus U, is selected to represent U, the value of z will always 
be negative (unless U, = U,, in which case z = 0). 

Employing Equation 12.4, the value z = — 2.88 is computed. 
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The obtained value z = —2.88 is evaluated with Table A1 (Table of the Normal Dis- 
tribution) in the Appendix. To be significant, the obtained absolute value of z must be equal 
to or greater than the tabled critical value at the prespecified level of significance. The tabled 
critical two-tailed .05 and .01 values are z ọ, = 1.96 and z 9, = 2.58, and the tabled critical one- 
tailed .05 and .01 values are zo, = 1.65 and z,, = 2.33. The following guidelines are em- 
ployed in evaluating the null hypothesis. 

a) If the nondirectional alternative hypothesis H: o * o is employed, the null hypothesis 
can be rejected if the obtained absolute value of z is equal to or greater than the tabled critical 
two-tailed value at the prespecified level of significance. 

b) If a directional alternative hypothesis is employed, one of the two possible directional 
alternative hypotheses is supported if the obtained absolute value of z is equal to or greater than 
the tabled critical one-tailed value at the prespecified level of significance. The directional 
alternative hypothesis that is supported is the one that is consistent with the data. 

Employing the above guidelines with Example 14.1, the following conclusions are reached. 

Since the obtained absolute value z = 2.88 is greater than the tabled critical two-tailed 
values zo, = 1.96 and z,, = 2.58, the nondirectional alternative hypothesis H: o; * 0; is 
supported at the both the .05 and .01 levels. Since the obtained absolute value z = 2.88 is greater 
than the tabled critical one-tailed values z o; = 1.65 and zy, = 2.33, the directional alternative 
hypothesis H;: o? > o is supported at both the .05 and .01 levels. The latter directional 
alternative hypothesis is supported since it is consistent with the data. The directional alternative 
hypothesis H: a < o is not supported, since it is not consistent with the data. Note that the 
conclusions reached with reference to each of the possible alternative hypotheses are consistent 
with those reached when the exact table of the U distribution is employed. 

As is the case when normal approximation is used with the Mann-Whitney U test, either 
U, or U, can be employed in Equation 12.4 to represent the value of U, since both values will 
yield the same absolute value for z. 


2. The correction for continuity for the normal approximation of the Siegel-Tukey test 
for equal variability Although not described in most sources, the correction for continuity 
employed for the normal approximation of the Mann-Whitney U test can also be applied to 
the Siegel-Tukey test for equal variability. Employing Equation 12.5 (the Mann-Whitney 
continuity-corrected equation) with the data for Example 14.1, the value z = —2.80 is com- 
puted. 

















TD NE Jo - 9). s 
= 2 z - -2.80 
nnm (n + n, +1) (6)(6)(6 + 6 +1) 
12 12 


The obtained absolute value z = 2.80 is greater than the tabled critical two-tailed .05 and 
.01 values zy; = 1.96 and z,, = 2.58, and the tabled critical one-tailed .05 and .01 values 
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Zos = 1.65 and Z = 2.33. Thus, as is the case when the correction for continuity is not 
employed, both the nondirectional alternative hypothesis H;: o? + o and the directional 
alternative hypothesis H,: o? > o are supported at both the 05 and .01 levels. Note that the 
absolute value of the condime: -corrected z value will always be less than the absolute value 
computed when the correction for continuity is not used. 


3. Tie correction for the normal approximation of the Siegel- Tukey test statistic Itis noted 
in the discussion of the normal approximation of the Mann-Whitney U test that some sources 
recommend that Equation 12.4 be modified when an excessive number of ties are present in the 
data. Since the identical sampling distribution is involved, the same tie correction (which results 
in a slight increase in the absolute value of z) can be employed for the normal approximation of 
the Siegel-Tukey test for equal variability. Employing Equation 12.6 (the Mann-Whitney 
tie correction equation), the tie-corrected value z = —2.91 is computed for Example 14.1. Note 
that in Example 14. 1 there are s = 5 sets of ties, each set involving two ties. Thus, in Equation 
12.6 the term ©) ES - t) = SQ? - 2] = 


nn, 
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The difference between z = —2.91 and the uncorrected value z = —2.88 is trivial, and 
consequently the decision the researcher makes with respect to the null hypothesis is not affected, 
regardless of which alternative hypothesis is employed. 


4. Adjustment of scores for the Siegel- Tukey test for equal variability when 0, + 0, It is 
noted in Section I that if the values of the population medians are known but are not equal, in 
order to employ the Siegel-Tukey test for equal variability it is necessary to adjust the scores. 
In such a case, prior to ranking the scores the difference between the two population medians is 
subtracted from each of the scores in the group that represents the population with the higher 
median (or added to each of the scores in the group that represents the population with the lower 
median). This adjustment procedure will be demonstrated with Example 14.2. 


Example 14.2 In order to evaluate whether or not two teaching methods result in different 
degrees of variability with respect to performance, a mathematics instructor employs two 
methods of instruction with different groups of students. Prior to initiating the study it is 
determined that the two groups are comprised of students of equal math ability. Group 1, which 
is comprised of five subjects, is taught through the use of lectures and a conventional textbook 
(Method A). Group 2, which is comprised of six subjects, is taught through the use of a computer 
software package (Method B). At the conclusion of the course the final exam scores of the two 
groups are compared. The final exam scores follow (the maximum possible score on the final 
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exam is 10 points and the minimum 0): Group 1: 7, 5,4,4,3; Group 2: 13, 12, 7, 7, 4,3. The 
researcher elects to rank-order the scores of the subjects, since she does not believe the data are 
normally distributed in the underlying populations. If the Siegel- Tukey test for equal vari- 
ability is employed to analyze the data, is there a significant difference in within-groups 
variability? 


From the sample data we can determine that the median score of Group 1 is 4, and the 
median score of Group 2 is 7. Although the computations will not be shown here, in spite of the 
three-point difference between the medians of the groups, if the Mann-Whitney U test is 
employed to evaluate the data, the null hypothesis H,: 0, = 0, (i.e., that the medians of the 
underlying populations are equal) cannot be rejected at the .05 level, regardless of whether a 
nondirectional or directional alternative hypothesis is employed. The fact that the null hypothesis 
cannot be rejected is largely the result of the small sample size, which limits the power of the 
Mann-Whitney U test to detect a difference between underlying populations, if, in fact, one 
exists. 

Let us assume, however, that based on prior research there is reason to believe that the 
median of the population represented by Group 2 is, in fact, three points higher than the median 
of the population represented by Group 1. In order to employ the Siegel-Tukey test for equal 
variability to evaluate the null hypothesis Hj: o = 05, the groups must be equated with 
respect to their median values. This can be accomplished by subtracting the difference between 
the population medians from each score in the group with the higher median. Thus, in Table 14.3 
three points have been subtracted from the score of each of the subjects in Group 2.? The scores 
in Table 14.3 are ranked in accordance with the Siegel-Tukey test protocol. The ranks are 
summarized in Table 14.4. 


Table 14.3 Data for Example 14.2 Employing Adjusted X, Scores 


Group 1 Group 2 
X, R X, R, 
Subject 1,1 7 6 Subject 1,2 10 
Subject 2,1 5 7 Subject 2,2 9 3 
Subject 3,1 4 9.5 Subject 3,2 4 9.5 
Subject 4,1 4 9.5 Subject 4,2 4 9.5 
Subject 5,1 3 5 Subject 5,2 1 4 
Subject 6,2 0 1 
ER, = 37 ER, 29 
= ER = YR 
ER ct. 23g R- -2 228 
n, 5 n, 6 


Table 14.4 Rankings for the Siegel- Tukey Test for Equal Variability for Example 14.2 


Subject identification number 62 5,2 5,1 3,1 4,1 32 42 24 1,1 22 1,2 
Exam score 0 1 3 4 4 4 4 5 7 9 10 
Rank prior to tie adjustment 1 4 5 8 9 ll 10 7 6 3 2 
Tie-adjusted rank 1 4 5 95 95 95 9.5 7 6 3 2 
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Equations 12.1 and 12.2 are employed to compute the values U, - 8 and U, - 22. 


n (n, +1 
U, = mn + MD -yn = (9) + BD - 37-8 
U = mn TY ogee a 6(6 + 1) - 
2 7 M, + 2 E mc (B) See n - 29 - 22 


Employing Equation 12.3, we confirm the relationship between the sample sizes and the 
computed values of U, and U,. 


nn, = U, + U, 


(5)(6) = 8 + 22 = 30 


Since U, = 8 is smaller than U, = 22, the value of U = 8. Employing Table A11 for 
n, = 5 and n, = 6, we determine that the tabled critical two-tailed .05 and .01 values are 
Uo, = 3 and U,, = 1, and the tabled critical one-tailed .05 and .01 values are Uy, = 5 and 
Up = 2. Since the obtained value U = 8 is greater than all of the aforementioned tabled 
critical values, the null hypothesis cannot be rejected at either .05 or .01 level, regardless of 
whether a nondirectional or directional alternative hypothesis is employed. 

If the normal approximation for the Siegel-Tukey test for equal variability is employed 
with Example 14.2, it is also the case that the null hypothesis cannot be rejected, regardless of 
which alternative hypothesis is employed. The latter is the case, since the computed absolute 
value z = 1.28 is less than the tabled critical .05 and .01 two-tailed values zo; = 1.96 and 
Zo, = 2.58, and the tabled critical .05 and .01 one-tailed values zy, = 1.65 and z,, = 2.33. 





g - 6) 
ie 2 = -1.28 
OOG + 6 +1) 

12 


Thus, the data do not indicate that the two teaching methods represent populations with 
different variances. Of course, as is the case when the Mann-Whitney U test is employed with 
the same set of data, it is entirely possible that a difference does exist in the underlying popu- 
lations but is not detected because of the small sample size employed in the study. 


VII. Additional Discussion of the Siegel-Tukey Test for Equal 
Variability 


1. Analysis of the homogeneity of variance hypothesis for the same set of data with both 
a parametric and nonparametric test, and the power-efficiency of the Siegel-Tukey test for 
equal variability As noted in Section I, the use of the Siegel-Tukey test for equal variability 
would most likely be based on the fact that a researcher has reason to believe that the data in the 
underlying populations are not normally distributed. If, however, in the case of Example 14.1 
the normality assumption is not an issue, or if it is but in spite of it a researcher prefers to use a 
parametric procedure such as Hartley's F nax test for homogeneity of variance/F test for two 


population variances, she can still reject the null hypothesis H: o = o, at both the .05 or .01 
levels, regardless of whether a nondirectional or directional alternative hypothesis is employed. 


© 2000 by Chapman & Hall/CRC 


This is demonstrated by employing Equation 11.6 (the equation for Hartley's Fmax test for 


homogeneity of variance) with the data for Example 14.1. 


max 


EX, =30 XX 2: XX 230. 3X; = 154 


282 -GV 154 - 30)" 
8 -— $ =A. qe : - 8 
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max "2 g. 
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Table A9 (Table of the F nax Distribution) in the Appendix is employed to evaluate the 
computed value Fax = 33. For k = 2 groups and (n - 1) = (6 - 1) = 5 (since n; = nj=n 


= 6), the appropriate tabled critical values for a nondirectional analysis are F us - 7.15 and 


P ss = 14.9. Since the obtained value F ax = 33 is greater than botn of the aforementioned 
tabled critical values, the nondirectional alternative hypothesis H: o? * o is supported at both 
the .05 and .01 levels.* 

In the case of a directional analysis, the appropriate tabled critical one-tailed .05 and .01 
values must be obtained from Table A10 (Table of the F Distribution) in the Appendix. In 
Table A10, the values for F,; and F ,, for df, - 2, -1=6-1=5and df,,, = n, -1=6- 
: = 5 are employed. The appropiate values derived from Table A10 are F}; = 5.05 and 

= 10.97. Since the obtained value F ax = 33 is greater Bum both Br the aforementioned 
d critical values, the directional alternative hypothesis H: o > d is supported at both 
the .05 and .01 levels. 

Note that the difference between the computed F ax (or F) value and the appropriate tabled 
critical value is more pronounced in the case of Hartley's Fmax test for homogeneity of 
variance/F test for two population variances than the difference between the computed test 
statistic and the appropriate tabled critical value for the Siegel-Tukey test for equal variability 
(when either the exact U distribution or the normal approximation is employed). The actual 
probability associated with the outcome of the analysis is, in fact, less than .01 for both the F max 
and Siegel-Tukey tests, but is even further removed from .01 in the case of the F max test. This 
latter observation is consistent with the fact that when both a parametric and nonparametric test 
are applied to the same set of data, the former test will generally provide a more powerful test 
of an alternative hypothesis. 

The above outcome is consistent with the fact that various sources (e.g., Marascuilo and 
McSweeney (1977) and Siegel and Castellan (1988)) note that the asymptotic relative efficiency 
(discussed in Section VII of the Wilcoxon signed-ranks test (Test 6)) of the Siegel-Tukey test 
relative to the F max test is only .61. However, the asymptotic relative efficiency of the Siegel- 
Tukey test may be considerably higher when the underlying population distributions are not 
normal. 


2. Alternative nonparametric tests of dispersion In Section I it is noted that the Siegel- 
Tukey test for equal variability is one of a number of nonparametric tests for ordinal data that 
have been developed for evaluating the hypothesis that two populations have equal variances. The 
determination with respect to which of these tests to employ is generally based on the specific 
assumptions a researcher is willing to make about the underlying population distributions. Other 
factors that can determine which test a researcher elects to employ are the relative power 
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efficiencies of the tests under consideration, and the complexity of the computations required 
for a specific test. This section will briefly summarize a few of the alternative procedures that 
evaluate the same hypothesis as the Siegel- Tukey test for equal variability. One or more of 
these procedures are described in detail in various books which specialize in nonpara- 
metric statistics (e.g., Conover (1980, 1999), Daniel (1990), Hollander and Wolfe (1999), 
Marascuilo and McSweeney (1977), Siegel and Castellan (1988), and Sprent (1993)). In 
addition, Sheskin (1984) provides a general overview and bibliography of nonparametric tests 
of dispersion. 

The Ansari-Bradley test (Ansari and Bradley (1960) and Freund and Ansari (1957)) 
evaluates the same hypothesis as the Siegel-Tukey test for equal variability, as well as sharing 
its assumptions. The Moses test for equal variability (Test 15) (Moses, 1963), which is 
described in the next chapter, can also be employed to evaluate the same hypothesis. However, 
the Moses test is more computationally involved than the two aforementioned tests. Unlike the 
Siegel-Tukey test and Ansari-Bradley test, the Moses test assumes that the data evaluated 
represent at least interval level measurement. In addition, the Moses test does not assume that 
the two populations have equal medians. Among other nonparametric tests of dispersion are 
procedures developed by Conover (Conover and Iman (1978), Conover (1980, 1999)), Klotz 
(1962), and Mood (1954). Of the tests just noted, the Siegel- Tukey test for equal variability, 
the Klotz test, and the Mood test can be extended to designs involving more than two 
independent samples. In addition to all of the aforementioned procedures, tests of extreme 
reactions developed by Moses (1952) (the Moses test of extreme reactions is described in 
Siegel (1956)) and Hollander (1963) can be employed to contrast the variability of two inde- 
pendent groups. Since there is extensive literature on nonparametric tests of dispersion, the 
interested reader should consult sources that specialize in nonparametric statistics for a more 
comprehensive discussion of the subject. 


VIII. Additional Examples Illustrating the Siegel-Tukey Test 
for Equal Variability 


TheSiegel-Tukey test for equal variability can be employed to evaluate the null hypothesis H: o = o 
with any of the examples noted for the ¢ test for two independent samples (Test 11) and the 
Mann-Whitney U test. In order to employ the Siegel-Tukey test for equal variability with 
any of the aforementioned examples, the data must be rank-ordered employing the protocol 
described in Section IV. Example 14.3 is an additional example that can be evaluated with the 
Siegel-Tukey test for equal variability. It is characterized by the fact that unlike Examples 
14.1 and 14.2, in Example 14.3 subjects are rank-ordered without initially obtaining scores that 
represent interval/ratio level measurement. Although it is implied that the ranks in Example 14.3 
are based on an underlying interval/ratio scale, the data are never expressed in such a format. 


Example 14.3 A company determines that there is no difference with respect to enthusiasm for 
a specific product after people are exposed to a monochromatic versus a polychromatic 
advertisement for the product. The company, however, wants to determine whether different 
degrees of variability are associated with the two types of advertisement. To answer the 
question, a study is conducted employing twelve subjects who as a result of having no knowledge 
of the product are neutral towards it. Six of the subjects are exposed to a monochromatic adver- 
tisement for the product (Group 1), and the other six are exposed to a polychromatic version of 
the same advertisement (Group 2). One week later each subject is interviewed by a market 
researcher who is blind with respect to which advertisement a subject was exposed. Upon 
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interviewing all 12 subjects, the market researcher rank-orders them with respect to their level 
of enthusiasm for the product. The rank-orders of the subjects in the two groups follow (assume 
that the lower the rank-order, the lower the level of enthusiasm for the product): 


Group 1: Subject 1,1: 12; Subject 2,1: 2; Subject 3,1: 4; Subject 4,1: 6; 
Subject 5,1: 3;Subject 6,1: 10 

Group 2: Subject 1,2: 7; Subject 2,2: 5; Subject 3,2: 9; Subject 4,2: 8; 
Subject 5,2: 11; Subject 6,2: 1 


Is there a significant difference in the degree of variability within each of the groups? 


Employing the ranking protocol for the Siegel- Tukey test for equal variability with the 
above data, the ranks of the two groups are converted into the following new set of ranks (i.e., 
assigning a rank of 1 to the lowest rank, a rank of 2 to the highest rank, a rank of 3 to the second 
highest rank, etc.). 


Group 1: Subject 1,1: 2;Subject 2,1: 4; Subject 3,1: 8; Subject 4,1: 12; 
Subject 5,1: 5; Subject 6,1: 6 

Group2: Subject 1,2: 11; Subject 2,2: 9; Subject 3,2: 7; Subject 4,2: 10; 
Subject 5,2: 3; Subject 6,2: 1 


Employing the above set of ranks, XR] = 37 and XR, = 41. Through use of Equations 
12.1 and 12.2, the values of U, and U, are computed to be U, = (6)(6) + [[6(6 + 1)]/2] — 37 
= 20 and U, = (6)(6) + [[6(6 + 1)]/2 - 41] = 16. Since U, = 16 is less than U, = 20, 
U = 16. In Table A11, for n, = 6 and n, = 6, the tabled critical two-tailed .05 and 01 
values are Us, = 5 and U,, = 2, and the tabled critical one-tailed .05 and .01 values are 
Uo; = 7and U,,= 3. Since U = 16 is greater than all of the aforementioned critical values, 
the null hypothesis H): o? - o cannot be rejected, regardless of whether a nondirectional or 
directional alternative hypothesis is employed. Thus, there is no evidence to indicate that the two 
types of advertisements result in different degrees of variability. 
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Endnotes 


1. As is the case with the Mann-Whitney U test, if the reverse ranking protocol is employed, 
the values of U, and U, are reversed. Since the value of U, which represents the test 
statistic, is the lower of the two values U, versus U,, the value designated U with the reverse 
ranking protocol will be the same U value obtained with the original ranking protocol. 


2. Asis the case with the Mann-Whitney U test, in describing the Siegel- Tukey test for 
equal variability some sources do not compute a U value, but rather provide tables which 
are based on the smaller and/or larger of the two sums of ranks. The equation for the normal 
approximation (to be discussed in Section VI) in these sources is also based on the sums of 
the ranks. 


3. As previously noted, we can instead add three points to each score in Group 1. 


4. If one employs Equation 11.7, and thus uses Table A10, the same tabled critical values 


are listed for F; and F4. for d£. = 6 - 1 =5 and dfin = 6-1 - 5. Thus, 
F 


97, = 7.15 and F ọọ; = 14.94. (The latter value is only listed to one decimal place in 


Table A9.) The use of Table A10 in evaluating homogeneity of variance is discussed in 
Section VI of the t test for two independent samples. 
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Test 15 


The Moses Test for Equal Variability 
(Nonparametric Test Employed with Ordinal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Do two independent samples represent two populations with 
different variances? 


Relevant background information on test Developed by Moses (1963), the Moses test for 
equal variability is a nonparametric procedure that can be employed in a hypothesis testing 
situation involving two independent samples. If the result of the Moses test for equal 
variability is significant, it indicates there is a significant difference between the sample 
variances, and as a result of the latter the researcher can conclude there is a high likelihood that 
the samples represent populations with different variances. 

The Moses test for equal variability is one of a number of tests of dispersion (also 
referred to as tests of scale or spread) that have been developed for contrasting the variances of 
two independent samples. A discussion of alternative nonparametric tests of dispersion can be 
found in Section VII. Some sources recommend the use of nonparametric tests of dispersion for 
evaluating the homogeneity of variance hypothesis when there is reason to believe that the 
normality assumption of the appropriate parametric test for evaluating the same hypothesis is 
violated. Sources that are not favorably disposed toward nonparametric tests recommend the use 
of Hartley’s Fwa test for homogeneity of variance/F test for two population variances 
(Test 11a) (or one of the alternative parametric tests that are available for evaluating homogeneity 
of variance), regardless of whether or not the normality assumption of the parametric test is 
violated. Such sources do, however, recommend that in employing a parametric test, a researcher 
employ a lower significance level to compensate for the fact that violation of the normality 
assumption can inflate the Type I error rate associated with the test. When there is no evidence 
to indicate that the normality assumption of the parametric test has been violated, sources are in 
general agreement that such a test is preferable to the Moses test for equal variability (or an 
alternative nonparametric test of dispersion), since a parametric test (which uses more information 
that a nonparametric test) provides a more powerful test of an alternative hypothesis. 

Since nonparametric tests are not assumption free, the choice of which of the available tests 
of dispersion to employ will primarily depend on what assumptions a researcher is willing to 
make with regard to the underlying distributions represented by the sample data. The Moses test 
for equal variability is based on the following assumptions: a) Each sample has been randomly 
selected from the population it represents; b) The two samples are independent of one another; 
c) The original scores obtained for each of the subjects are in the format of interval/ratio data, 
and the dependent variable is a continuous variable. (A continuous variable is characterized 
by the fact that a given score can assume any value within the range of values that define the 
limits of that variable.); and d) The underlying populations from which the samples are derived 
are similar in shape. 
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Itis important to note that a major difference between the Moses test for equal variability 
and the previously discussed Siegel-Tukey test for equal variability (Test 14) is that the Moses 
test does not assume that the two populations from which the samples are derived have equal 
medians (which is an assumption underlying the Siegel-Tukey test). 

It should be noted that all of the other tests in this text that rank data (with the exception of 
the Wilcoxon signed-ranks test (Test 6) and the Wilcoxon matched-pairs signed-ranks test 
(Test 18)) rank the original interval/ratio scores of subjects. The Moses test for equal vari- 
ability, however, does not rank the original interval/ratio scores, but instead ranks the sums of 
squared difference/deviation scores. For this reason, some sources (e.g., Siegel and Castellan 
(1988)) categorize the Moses test for equal variability as a test of interval/ratio data. In this 
book, however, the Moses test for equal variability is categorized as a test of ordinal data, by 
virtue of the fact that a ranking procedure constitutes a critical part of the test protocol. 


II. Example 


Example 15.1 is identical to Example 14.1 (which is evaluated with the Siegel- Tukey test for 
equal variability). Although Example 15.1 suggests that the underlying population medians are 
equal, as noted above, the latter is not an assumption of the Moses test for equal variability. 


Example 15.1 Jn order to assess the effect of two antidepressant drugs, 12 clinically depressed 
patients are randomly assigned to one of two groups. Six patients are assigned to Group 1, 
which is administered the antidepressant drug Elatrix for a period of six months. The other six 
patients are assigned to Group 2, which is administered the antidepressant drug Euphryia during 
the same six-month period. Assume that prior to introducing the experimental treatments, the 
experimenter confirmed that the level of depression in the two groups was equal. After six 
months elapse, all 12 subjects are rated by a psychiatrist (who is blind with respect to a subject's 
experimental condition) on their level of depression. The psychiatrist's depression ratings for 
the six subjects in each group follow (the higher the rating, the more depressed a subject): 
Group 1: 10, 10, 9, 1, 0, 0; Group 2: 6, 6, 5, 5, 4, 4. 

The fact that the mean and median of each group are equivalent (specifically, both values 
equal 5) is consistent with prior research which suggests that there is no difference in efficacy 
for the two drugs (when the latter is based on a comparison of group means and/or medians). 
Inspection of the data does suggest, however, that there is much greater variability in the 
depression scores of subjects in Group 1. To be more specific, the data suggest that the drug 
Elatrix may, in fact, decrease depression in some subjects, yet increase it in others. The re- 
searcher decides to contrast the variability within the two groups through use of the Moses test 
for equal variability. The use of the latter nonparametric test is predicated on the fact that 
there is reason to believe that the distributions of the posttreatment depression scores in the 
underlying populations are not normal (which is why the researcher is reluctant to evaluate the 
data with Hartley's F „ax test for homogeneity of variance/F test for two population variances). 
Do the data indicate there is a significant difference between the variances of the two groups? 


III. Null versus Alternative Hypotheses 
The test statistic for the Moses test for equal variability is computed with the Mann-Whitney 
U test (Test 12). In order to understand the full text of the null and alternative hypotheses pre- 


sented in this section, the reader will have to read the protocol involved in conducting the Moses 
test for equal variability, which is described in Section IV. 
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Null hypothesis Hy o; = o 


(The variance of the population Group 1 represents equals the variance of the population Group 
2 represents. When, within the framework of the Mann-Whitney U test analysis to be 
described, both groups have an equal sample size, this translates into the sum of the ranks of the 
sums of the squared difference scores of Group 1 being equal to the sum of the ranks of the sums 
of the squared difference scores of Group 2. A more general way of stating this, which also 
encompasses designs involving unequal sample sizes, is that the means of the ranks of the sums 
of the squared difference scores of the two groups are equal.) 


Alternative hypothesis H: o; * o; 


(The variance of the population Group 1 represents does not equal the variance of the population 
Group 2 represents. When, within the framework of the Mann-Whitney U test analysis to be 
described, both groups have an equal sample size, this translates into the sum of the ranks of the 
sums of the squared difference scores of Group 1 not being equal to the sum of the ranks of the 
sums of the squared difference scores of Group 2. A more general way of stating this, which also 
encompasses designs involving unequal sample sizes, is that the means of the ranks of the sums 
of the squared difference scores of the two groups are not equal. This is a nondirectional 
alternative hypothesis and it is evaluated with a two-tailed test.) 


or 


2 


2 
H: 0, > 0, 


(The variance of the population Group 1 represents is greater than the variance of the population 
Group 2 represents. When, within the framework of the Mann-Whitney U test analysis to be 
described, both groups have an equal sample size (so long as a rank of 1 is given to the lowest 
score), this translates into the sum of the ranks of the sums of the squared difference scores of 
Group 1 being greater than the sum of the ranks of the sums of the squared difference scores of 
Group 2. A more general way of stating this, which also encompasses designs involving unequal 
sample sizes, is that the mean of the ranks of the sums of the squared difference scores of Group 
1 is greater than the mean of the ranks of the sums of the squared difference scores of Group 2. 
This is a directional alternative hypothesis and it is evaluated with a one-tailed test.) 


or 
2 2 
H: 0 < 0, 


(The variance of the population Group 1 represents is less than the variance of the population 
Group 2 represents. When, within the framework of the Mann-Whitney U test analysis to be 
described, both groups have an equal sample size (so long as a rank of 1 is given to the lowest 
score), this translates into the sum of the ranks of the sums of the squared difference scores of 
Group 1 being less than the sum of the ranks of the sums of the squared difference scores of 
Group 2. A more general way of stating this, which also encompasses designs involving unequal 
sample sizes, is that the mean of the ranks of the sums of the squared difference scores of Group 
1 is less than the mean of the ranks of the sums of the squared difference scores of Group 2. This 
is a directional alternative hypothesis and it is evaluated with a one-tailed test.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected. 
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IV. Test Computations 


The protocol described below is employed for computing the test statistic for the Moses test for 
equal variability. In employing the protocol the following values are applicable: a) The total 
number of subjects employed in the experiment is N — 12; and b) There are n, - 6 subjects in 
Group 1 and n, = 6 subjects in Group 2. 

a) The protocol for the Moses test for equal variability requires that the original interval/ 
ratio scores be broken down into subsamples. A subsample is a set of scores derived from a 
sample, with the number of scores in a subsample being less than the total number of scores in 
the sample. 

b) Divide the n, scores in Group 1 into m, subsamples (where m, > 1), with each sub- 
sample being comprised of k scores. Selection of the k scores for each of the m, subsamples 
should be random. Sampling without replacement (which is defined in Endnote 1 of the 
binomial sign test for a single sample (Test 9)) is employed in forming the subsamples. In 
other words, each of the n, scores in Group 1 is employed in only one of the m, subsamples. 

c) Divide the n, scores in Group 2 into m, subsamples (where m, > 1), with each 
subsample being comprised of k scores. Selection of the k scores for each of the m, sub- 
samples should be random. Sampling without replacement is employed in forming the 
subsamples. In other words, each of the n, scores in Group 2 is employed in only one of the m, 
subsamples.! 

d) Note that regardless of which group a subsample is derived from, all subsamples will be 
comprised of the same number of scores (i.e., k scores). The number of subsamples derived from 
each group, however, need not be equal. In other words, the values of m, and m, do not have 
to be equivalent. The number of scores in each subsample should be such that the products 
(m,)(k) and (m,)(k) include as many of the scores as possible. Although the optimal situation 
would be if (m,)(k) = n, and (m,)(k) = n,, it will often not be possible to achieve the latter (i.e., 
include all N scores in the m, + m, subsamples). 

To illustrate the formation of subsamples, let us assume that n, = 20 and n, = 20. Em- 
ploying the data from Group 1, we can form m, = 4 subsamples comprised of k = 5 scores 
per subsample. Thus, each of the n, = 20 scores in Group 1 will be included in one of the 
subsamples. Employing the data from Group 2, we can form m, = 4 subsamples comprised of 
k=5 scores per subsample. Thus, each of the n, = 20 scores in Group 2 will be included in one 
of the subsamples. Now let us assume that in Group 1 there are only 18 subjects (i.e., 
n, = 18 and n, = 20). If we still employ k = 5 scores per subsample, we can only form 
m, - 3 subsamples (which includes 15 of the 18 scores in Group 1). In such a case, three scores 
in Group 1 will have to be omitted from the analysis (which will employ m, = 3 subsamples 
comprised of k — 5 scores per subsample, and m, = 4 subsamples comprised of k = 5 scores per 
subsample). In order to include more subjects in the total analysis with n) = 18 and n, = 20, 
we can employ k = 4 scores per subsample, in which case we will have m, = 4 subsamples 
comprised of k = 4 scores per subsample, and m, = 5 subsamples comprised of k = 4 scores per 
subsample. In the latter case, only two scores in Group 1 will have to be omitted from the 
analysis. Obviously, if n, = 18 and n, = 20, we can only include all N = 20 subjects in the 
analysis, if we have k = 2 scores per subsample. In such a case we will have m, = 9 subsamples 
comprised of k = 2 scores per subsample, and m, = 10 subsamples comprised of k = 2 scores 
per subsample. 

Daniel (1990) notes that Shorack (1969) recommends the following criteria in determining 
the values of k, m,,and m,: 1) k should be as large as possible, but not more than 10; and 2) The 
values of m, and m, should be large enough to derive meaningful results. The latter translates 
into employing values for m, and m, that meet the minimum sample size requirements for the 
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Mann-Whitney U test, which is employed to compute the test statistic for the Moses test for 
equal variability. This latter point will be clarified in Section V. 

d) Compute the mean of each of the m, subsamples derived from Group 1. Within each 
subsample do the following: 1) Subtract the mean of the subsample from each of the k scores in 
the subsample; 2) Square each of the difference scores; and 3) Obtain the sum of the k squared 
difference scores. The notation XD. will represent the sum of the squared difference scores for 
the i" subsample in Group 1. There will be a total of m, ED? scores for Group 1. 

e) Compute the mean of each of the m, subsamples derived from Group 2. Within each 
subsample do the following: 1) Subtract the mean of the subsample from each of the k scores 
in the subsample; 2) Square each of the difference scores; and 3) Obtain the sum of the k squared 
difference scores. The notation ED} will represent the sum of the squared difference scores for 
the i" subsample in Group 2. There will be a total of m, ED? scores for Group 2. 

f) The reader may want to review the protocol for the Mann-Whitney U test (in Section 
IV of the latter test) prior to continuing this section, since at this point in the analysis the Mann- 
Whitney U test is employed to compute the test statistic for the Moses test for equal variability. 
Specifically, within the framework of the Mann-Whitney U test model, each of the m, sums of 
the squared difference scores in Group 1 (i.e., the m, ED? scores) is conceptualized as one iof the n, 
scores in Group 1, and each of the m, sums of the snared difference scores in Group 2 (i.e., the m, 
ED} scores) is conceptualized as one of the n, scores in Group 2. 

If the null hypothesis is true, it is pinl that the rank orders for the sums of the squared 
difference scores in Groups 1 and 2 will be evenly dispersed and, consequently, the sum of the 
ranks for the Group 1 ED? scores will be equal or close to the sum of the ranks for the Group 
2 ED} scores. If, on the other hand, there is greater variability in one of the groups, the rank- 
orderings of the sums of the squared difference scores for that group will be higher than the rank- 
orderings of the sums of the squared difference scores for the other group. 

The protocol described in this section for the Moses test for equal variability will now 
be employed to evaluate Example 15.1. Table 15.1 summarizes the initial part of the analysis, 
which requires that subsamples be selected from each of the groups. As a result of the small 
sample size employed in the study, each subsample will be comprised of k = 2 scores, and thus 
there will be m, = 3 subsamples for Group 1 and m, = 3 subsamples for Group 2. Since 
(m) = (3)(2) = 6 = n, and (m,)(k) = (3)(2) = 6 = n,, the scores of all N = 12 subjects 
are employed in the analysis. As noted earlier, the assignment of scores to subsamples is random. 
In this instance the author used a table of random numbers to select the scores for each of the 
subsamples.” 


Table 15.1 Summary of Analysis of Example 15.1 


Group 1 
Subsample X X-X) (x - xy EX - Xy = ED? 
1) 1,10 5.5 4.5, 4.5 20.25, 20.25 40.5 
2) 10, 0 5 5,-5 25, 25 50 
3) 9,0 4.5 4.5, 4.5 20.25, 20.25 40.5 
Group 2 
Subsample X X - X) X-X? EX - Xy = ED 
1)4,4 4 0,0 0,0 0 
2)5,6 5.5 —5,.5 25, 25 5 
3) 5,6 5.5 —5,.5 25, 25 5 
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15.2 Analysis of Example 15.1 with Mann-Whitney U Test 


Group 1 Group 2 
ED? Rank XD; Rank 
40.5 4.5 0 1 
50 6 5 2.5 
40.5 4.5 i 2.5 
ER = 15 ÈR, = 6 
n(n +1 
U, = nn, + aa z l - XR, = (90) + 38 * D > B. 15-0 
nn, + 1 
U, = nn, > BY yg = yg «38 1D 6-9 


The information in Table 15.1 can be summarized as follows: a) Column 1 lists the 
two scores in each subsample. Each row contains a separate subsample; b) Column 2 lists 
the mean (X) of each subsample; c) Column 3 lists the difference scores ((X - X)) obtained 
when the mean of a subsample is subtracted from each score in the subsample; d) Column 4 
lists the squared difference scores ((X - X)?) for each subsample (which are the squares of 
the scores in ae 3); and e) Column 5 lists the sum of the squared difference scores 
(XX -X y = D;) for each of the subsamples (1.e., within each row/subsample, the values 
in Column 5 are E sum of the values in COUR 4). Note that the notation ED; represents the 
sum of the squared difference scores of the i subsample in Group j. The values in Column 5 
are evaluated in Table 15.2 with the Mann-Whitney U test. 


V. Interpretation of the Test Results 


The smaller of the two values U, versus U, is designated as the obtained U statistic. Since 
U, = 0 is smaller than U, = 9, the value of U = 0. The value of U is evaluated with Table 
A11 (Table of Critical Values for the Mann-Whitney U Statistic) in the Appendix.’ In 
the case of Example 15.1, there are three scores in each group (which are the sums of the squared 
difference scores for the three subsamples that comprise that group). Thus, n, - 3 and 

- 3. Because of the small sample size, Table A11 does not list critical two-tailed .05 and 
.01 values, nor does it list a critical one-tailed .01 value. It does, however list the critical one- 
tailed .05 value Uy, = 0. In order to be significant, the obtained value of U must be equal to 
or less than the tabled critical value at the prespecified level of significance. Since U = 0 is equal 
to Uy = 0, the directional alternative hypothesis that is consistent with the data is supported 
at the .05 level. The latter alternative hypothesis is H;: o > oj. 

In Section III it is noted that when both groups EA an equal sample size, if the directional 
alternative H;: o? > o; is supported, the sum/average of the ranks of the sums of the squared 
difference scores of Group 1 will be greater than the sum/average of the ranks of the sums of the 
squared difference scores of Group 2 (i.e., UR, z YR 2); Since the latter is the case in Example 
15.1, the directional alternative iispolliesit H: o a o is Supported at the .05 level. 

For the directional alternative liypotesis H: o? < o to be supported, the sum/average 
of the ranks of the sums of the squared difference scores of Group 2 must be greater than the 
sum/ average of the ranks of the sums of the squared difference scores of Group 1 (i.e., 
YR, < XR,).Since,asnotedabove, the opposite is true, the directional alternative hypothesis H : o? « o, 
is not supported. 
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For the nondirectional alternative hypothesis H}: o? # o to be supported, the sum/average 
of the ranks of the sums of the squared difference scores of Group 1 must not be equal to the 
sum/average of the ranks of the sums of the squared difference scores of Group 2 (i.e., 
YR, + XR,). In point of fact, the latter is true, and the computed value U = 0 is the smallest 
possible U value that can be obtained. However, as noted earlier, because of the small sample 
size, no two-tailed .05 and .01 critical values are listed in Table A11 for n, = 3 and n, = 3. 

Based on the results of the Moses test for equal variability, the researcher can conclude 
that there is greater variability in the depression scores of the group that receives the drug Elatrix 
(Group 1) than the group that receives the drug Euphyria (Group 2). 

When the same data are evaluated with the Siegel-Tukey test for equal variability, as well 
as with Hartley's F nax test for homogeneity of variance/F 1 test for two population variances, 
both the Ed alternative hypothesis H;: o * o, and the directional alternative hy- 
pothesis H}: o > o are supported at both the ‘05 and .01 levels: The fact that the latter two 
alternative Diccédures for evaluating the null hypothesis H,: a, = = o yield a more significant 
result than the Moses test is consistent with the fact that of thie three procedures, the one with the 
lowest statistical power is the Moses test (the issue of the power of the Moses test for equal 
variability is discussed in greater detail in Section VII). 


VI. Additional Analytical Procedures for the Moses Test for 
Equal Variability and/or Related Tests 


1. The normal approximation of the Moses test statistic for large sample sizes Although 
the sample size is too small to employ the large sample normal approximation of the 
Mann-Whitney U test statistic, for demonstration purposes the latter value is computed below 
with Equation 12.4. As 8 noted in Section VI of the Mann-Whitney U test, the large sample 
normal approximation is generally used for sample sizes larger than those documented in the 
exact table for the Mann- Whitney U test contained within the source one is employing. 
Employing Equation 12.4, he absolute value z = 1.96 is computed. 

The obtained absolute value z = 1.96 is evaluated with Table A1 (Table of the Normal 
Distribution) in the Appendix. In order to be significant, the obtained absolute value of z 
must be equal to or greater than the tabled critical value at the prespecified level of significance. 
The tabled critical two-tailed .05 and .01 values are zo, = 1.96 and zy, = 2.58, and the tabled 
critical one-tailed .05 and .01 values are zo, = 1.65 and zy, = 2.33. 





U - nin, 0 - (3)(3) 
eS E A Gn 
nn (n + n, + 1) (3)(3)(3 + 3 +1) 
12 12 


The following guidelines are employed in evaluating the null hypothesis. 

a) If the nondirectional alternative hypothesis H: o + o is employed, the null hypothesis 
can be rejected if the obtained absolute value of z is equal to or greater than the tabled critical 
two-tailed value at the prespecified level of significance. 

b) If a directional alternative hypothesis is employed, one of the two possible directional 
alternative hypotheses is supported if the obtained absolute value of z is equal to or greater than 
the tabled critical one-tailed value at the prespecified level of significance. The directional 
alternative hypothesis that is supported is the one that is consistent with the data. 

Since the computed absolute value z = 1.96 is equal to zy; = 1.96 but less than 
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Zo, = 2.58, the nondirectional alternative hypothesis H;: o? * o is supported at the .05 
level, but not at the .01 level. Since the computed absolute value z = 1.96 is greater than 
Zos = 1.65 but less than zo, = 2.33, the directional alternative hypothesis H;: o; > o is 
supported at the .05 level, but not at the .01 level. 

The continuity-corrected Equation 12.5 yields the slightly lower absolute value z = 1.75. 
Since the computed absolute value z = 1.75 is greater m Z 5 = 1.65 but less than 
Zo, = 2.33, the directional alternative hypothesis H: o; > o is supported, but only atthe .05 
level. Since the computed absolute value z = 1.75 i T less than zy, = 1.96 and z,, = 2.58, the 
nondirectional alternative hypothesis H,: o? * a is not supported. Thus, the result obtained 
with the continuity-corrected equation is emails the same as the result obtained when the values 
in Table A11 are employed. 

















TENDO as b - 28|- s 
p 2 2 = -1.75 
nm (n + n, +1) (3)(3)(3 +3 +1) 
12 12 


VII. Additional Discussion of the Moses Test for Equal 
Variability 


1. Power-efficiency of the Moses test for equal variability Daniel (1990) and Siegel and 
Castellan (1988) note that the power-efficiency of the Moses test for equal variability relative 
to a parametric procedure such as Hartley's F nax test for homogeneity of variance/F test for 
two population variances is a function of the size of the subsamples. With small subsamples 
the asymptotic relative efficiency (which is discussed in Section VII of the Wilcoxon signed- 
ranks test) of the test is relatively low if the underlying population distributions are normal (e.g., 
the power efficiency is .50 when k = 3). Although as the value of k increases, the power 
efficiency approaches an upper limit of .95, the downside of employing a large number of scores 
in each subsample is that as k increases, the number of subsamples that are available for analysis 
will decrease (and the latter will compromise the power of the Mann-Whitney analysis 
employed to compute the Moses test statistic). Thus, in deciding whether to employ the Moses 
test for equal variability, the researcher must weigh the test's relatively low power efficiency 
against the following factors: a) The extreme sensitivity of Hartley's F max test for homogeneity 
of variance/F test for two population variances to violations of the assumption of normality 
in the underlying populations (which is not an assumption of the Moses test); and b) The 
assumption of equal population medians associated with the Siegel- Tukey test for equal 
variability. Since the power of a statistical test is directly related to sample size, for the same 
set of data, the Siegel- Tukey test for equal variability will have higher power than the Moses 
test for equal variability. The latter is true, since the sample size for the test statistic with the 
Siegel- Tukey test will always be the values 7, and n,, while with the Moses test the sample 
size will be m, and m, (the values of which will always be less than n, and n, ). 


2. Issue of repetitive resampling An obvious problem associated with the Moses test for 
equal variability is that its result is dependent on the configuration of the data in each of the 
random subsamples employed in the analysis. Itis entirely possible that an analysis based on one 
set of subsamples may yield a different result than an analysis based on a different set of sub- 
samples. Because of the latter, if a researcher has a bias in favor of obtaining a significant (or 
perhaps nonsignificant) result, she can continue to select subsamples until she obtains a set of 
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subsamples that yields the desired result. Obviously, the latter protocol is inappropriate and 
would compromise the integrity of one's results. Put simply, the Moses test for equal variabil- 
ity should be run one time, with the researcher accepting the resulting outcome. If the researcher 
has reason to believe the outcome does not reflect the truth regarding the populations in question, 
a replication study should be conducted. 


3. Alternative nonparametric tests of dispersion In Section Lit is noted that the Moses test 
for equal variability is one of a number of nonparametric tests that have been developed for 
evaluating the hypothesis that two populations have equal variances. The determination with 
respect to which of these tests to employ is generally based on the specific assumptions a 
researcher is willing to make about the underlying population distributions. Other factors that 
can determine which test a researcher elects to employ are the relative power efficiencies of the 
tests under consideration, and the complexity of the computations required for a specific test. 
As noted in Section VI of the Siegel- Tukey test for equal variability, among the other 
procedures that are available for evaluating a hypothesis about equality of population variances 
are the Ansari- Bradley test (Ansari and Bradley (1960) and Freund and Ansari (1957)), and 
nonparametric tests of dispersion developed by Conover (Conover and Iman (1978), Conover 
(1980, 1999)), Hollander (1963), Klotz (1962), Mood (1954), and Moses (1952). Of the 
aforementioned tests, the Siegel- Tukey test for equal variability, the Klotz test, and the Mood 
test can be extended to designs involving more than two independent samples. 

Since there is extensive literature on nonparametric tests of dispersion, the interested reader 
should consult sources that specialize in nonparametric statistics for a more comprehensive 
discussion of the subject. One or more of these procedures are described in detail in various 
books that specialize in nonparametric statistics (e.g., Conover (1980, 1999), Daniel (1990), 
Hollander and Wolfe (1999), Marascuilo and McSweeney (1977), Siegel and Castellan (1988), 
and Sprent (1993)). In addition, Sheskin (1984) provides a general overview and bibliography 
of nonparametric tests of dispersion. 


VIII. Additional Examples Illustrating the Moses Test for 
Equal Variability 


In the discussion of the Siegel- Tukey test for equal variability, Example 14.2 was employed 
to illustrate an example in which a researcher is not able to assume that the two population 
medians are equal. Although the latter is a situation where the Moses test for equal variability 
would be more appropriate to employ than the Siegel- Tukey test (since the Moses test doesn't 
assume equal population medians), because of the small sample size (n, - 5 and n, - 6)the 
Moses test cannot be employed. In the case of Example 14.2, the only way to obtain more than 
one subsample per group is to set k = 2, and have m, = 2 subsamples in Group 1 (for which the 
five scores are 7, 5, 4, 4, 3), and m, - 3 subsamples in Group 2 (for which the six scores are 13, 
12, 7, 7, 4, 3). However, since there are no critical values listed in the Mann-Whitney table 
(i.e., Table A11) for n, = 2 and n, = 3 (which are the number of subsamples/sums of squared 
difference scores in each group), the probability level associated with the result of the Mann- 
Whitney U test will always be above .05, regardless of whether a one-tailed or two-tailed 
analysis is employed. 

Example 15.2 is an additional problem that will be evaluated with the Moses test for equal 
variability. 


Example 15.2 A researcher wants to determine whether or not a group of subjects who are 
given a low dose of a stimulant drug exhibit more variability with respect to the number of errors 
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they make on a test of eye-hand coordination than a group of subjects who are given a placebo. 
There are n, = 12 subjects in the group administered the drug and n, = 17 subjects in the 
placebo group. The scores of the N = 29 subjects are listed below. 


8, 5 2 1, 14,18, 8, 8 
Group 2: 7, 7, 7, 8, 9, 7, 8,9, 8, 8, 7, 10, 11, 12, 7, 9, 5 


Is there a significant difference between the degree of variability within each of the groups? 


In evaluating the data with the Moses test for equal variability we will employ k = 3 
subjects per subsample, which means that we will derive m, - 4 subsamples for Group 1 (since 
n, = 12 divided by 3 equals 4), and m, = 5 subsamples for Group 1 (since n, = 17 divided 
by 3 equals 5, with a remainder of 2). Since n, = 17 is not evenly divisible by 3, two of the 
scores in Group 2 will not be included in the Group 2 subsamples (specifically, the score of one 
of the six subjects who obtained a 7 and the score of the subject who obtained a 5 were not 
selected for inclusion in the Group 2 subsamples during the random selection process). Tables 
15.3 and 15.4 summarize the analysis. In Table 15.4, within the framework of the Mann- 
Whitney U test, the values n, =4 and n, = 5 are employed to represent the m, = 4 sums of 
squared difference scores for Group 1 and the m, = 5 sums of squared difference scores for 
Group 2. 


Table 15.3 Summary of Analysis of Example 15.2 


Group 1 
Subsample x (x= X) (X - xy Xx - xy = ED? 
1) 5, 6,4 5 0, fi 0,1, 1 2 
2) 8, 18, 1 9 -1, 9, -8 1, 81, 64 146 
3) 8, 14,9 10.33 -2.33, 3.67, 1.33. 5.43, 13.47, 1.77 20.67 
4)3,8,2 4.33 -1.33,3.67,-2.33 1.77, 13.47, 543 20.67 
Group 2 
Subsample X X-X) x - xy E(X = X? = Xp; 
1) 12,7,7 8.67 3.33, -1.67,-1.67 11.09, 2.79, 2.79 16.67 
2)9,9,8 8.67 33, .33, —.67 AL, 11, .45 67 
3) 10, 11,7 9.33 67, 1.67, -2.33 45, 2.79, 5.43 8.67 
4) 8,8,7 7.67 33, 33, —.67 11, 11, .45 67 
5) 8, 7,9 8 0,21, 1 0,1,1 2 


Since U, = 2.5 is smaller than U, = 17.5, the value of U = 2.5. Employing Table A11, 
for the values n, = 4 and n, = 5, the tabled critical two-tailed .05 value is Uy; = 1. Because 
of the small sample size, no tabled critical two-tailed .01 value is listed. The tabled critical one- 
tailed values are Uy, = 2 and Uy, = 0. Since the computed value U, = 2.5 is larger than all 
of the tabled critical values, the null hypothesis cannot be rejected, regardless of which 
alternative hypothesis is employed. Thus, the researcher cannot conclude that there are 
differences in the variances of the two groups. 

The data for Example 15.2 are consistent with the directional alternative hypothesis 
Hi: o? > o , since the average of the ranks for Group 1 (R, = 6.875) is larger than the average 
of the ranks for Group 2 (R, =3.5). Support for the latter directional alternative hypothesis falls 
just short of being significant, since U = 2.5 is only .5 units above the tabled critical one-tailed 
value Uy, = 2. 
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15.4 Analysis of Example 15.2 with Mann-Whitney U Test 


Group 1 Group 2 
XD; Rank ED} Rank 
2 3.5 16.67 6 
146 9 .67 1.5 
20.67 7.5 8.67 5 
20.67 7.5 .67 1.5 
2 3.5 
ER =27.5 ER,= 17.5 
n(n * 1) 4(4 + 1) 
U, = nn, + =z; XR, = (4)(5) + NS cond 27.5 = 2.5 
n(n, * 1) 5(5 + 1) 
U, =n * 23 — B XR, B (4)(5) * 73 - 17.5 = 17.5 


15.5 Analysis of Example 15.2 with Siegel- Tukey Test for Equal Variability 


Group 1 Group 2 
ED? Rank ED} Rank 
8 24.86 7 20.5 
5 10.5 T 20.5 
4 8 7 20.5 
3 5 8 24.86 
2 4 9 14.5 
9 14.5 T 20.5 
6 13 8 24.86 
1 1 9 14.5 
14 3 8 24.86 
18 2 8 24.86 
8 24.86 7 20.5 
8 24.86 10 10 
11 7 
12 6 
J: 20.5 
9 14.5 
5 10.5 
XR = 135.58 XR, = 299.44 
" 
U, = nn, + m - XR, = (1217) + d - 135.58 - 146.42 


1 
. n(n, + 1) 


SER, = (12017) + 


a - 299.44 = 57.56 


It turns out that when the data for Example 15.2 are evaluated with the Siegel-Tukey test 
for equal variability, the computed value of the test statistic (through use of the Mann-Whitney 
U test) is U=57.56.* Table 15.5 summarizes the analysis with the Siegel-Tukey test. In the case 
of the Siegel-Tukey test, the original sample size values n, = 12 and n, = 17 are employed in 
obtaining critical values from Table A11. For, = 12 and n, = 17, the tabled critical two-tailed 
.05 and 01 values are Us. = 57 and U,, = 44, and the tabled critical one-tailed .05 and .01 
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values are U,. = 64 and U = 49. Since U = 57.56 is greater (albeit barely) than the tabled 
critical two-tailed value U9. = 57,thenondirectional alternative hypothesis H: o; # 0 is not 
supported. The directional alternative hypothesis H,: o? > o; is supported, but only at the .05 
level, since U = 57.56 is less than the tabled critical one-tailed value U,, = 64. Thus, when the 
Siegel-Tukey test for equal variability is employed, the researcher can conclude that there is 
greater variability in the scores of Group 1. 

When the data for Example 15.2 are evaluated with Hartley's F max test for homogeneity 
of variance/F test for two population variances, the nondirectional alternative hypothesis 
H: o? + a, and the directional alternative hypothesis H,: o? > o? are supported at both the 
.05 and .01 levels. The computations for the latter test are shown below. Equation 11.6 is 
employed to compute the value Fax = 8.39. 


X 





(EXy 2 
Xx -——— sm - c 
$] 9 ——— l ‘M Al 2433 
n -1 12-1 
ÈX? 2 
Ig =- j- 039 
Pec 9. a 
n,-1 17 - 1 
Ss 
ee eee 


The computed value Fax = 8.39 is evaluated with Table A9 (Table of the Fmax Dis- 


tribution) in the Appendix. The tabled critical values for the F ax distribution are listed in 
reference to the values (n — 1) and k, where n represents the number of subjects per group, and 
k represents the number of groups. In the case of Example 15.2, the computed value 
F aax 7 8-39 is larger than the tabled critical values in Table A9 for k = 2 and n= 12. (With 
unequal sample sizes, for the most conservative test of the null hypothesis, we employ the smaller 
of the two sample size values n, = 12 and n, = 17 . The tabled critical values for k = 2 and 


n= 12are F = 3.28 and Fax = 4.91. Since the obtained value F ay = 8.39 is larger 


max os X 01 max 
than both of the aforementioned critical values, we can reject the null hypothesis at both the .05 


and .01 level. Thus, the nondirectional alternative hypothesis H: o? * o is supported. If the 
latter nondirectional alternative hypothesis is supported at both the .05 and.01 levels, the 
directional alternative hypothesis H,: o? > o% willalso be supported at both .05 and .01 levels, 
since the critical values for the latter alternative hypothesis will be lower than the critical two- 
tailed values noted above. Thus, when Hartley's F nax test for homogeneity of variance/F test 
for two population variances is employed, the researcher can conclude that there is a higher 
degree of variability in the scores of Group 1. As noted earlier, the latter test will generally 
provide a more powerful test of an alternative hypothesis than either the Moses test (which does 
not yield a significant result for Example 15.2) or the Siegel- Tukey test (which does yield a 
significant result for Example 15.2, but only for a one-tailed alternative hypothesis). 
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Endnotes 


1. Onecould argue that the use of random subsamples for the Moses test for equal variability 
allows one to conceptualize the test within the framework of the general category of re- 
sampling procedures, which are discussed in Section IX (the Addendum) of the Mann- 
Whitney U test. 


2. A typical table of random numbers is a computer generated series of random digits that 
fall within the range 0 through 9. If there are six scores in a group, the sequential appear- 
ance of the digits 1 through 6 in the table can be used to form subsamples. For example, 
let us assume that the following string of digits appears in a random number table: 
2352239455675900912937373949404. We will form three subsamples, with two scores per 
subsample. Since the first digit that appears in the random number table is 2, the second score 
listed for the group will be the first score assigned to a Subsample 1. Since 3 is the next digit, 
the third score listed becomes the second score in Subsample 1. Since 5 is the next digit, the 
fifth score listed becomes the first score in the Subsample 2. We ignore the next four digits 


© 2000 by Chapman & Hall/CRC 


(2, 2, 3 and 9) since: a) We have already selected the second and third scores from the group; 
and b) The digit 9 indicates that we should select the ninth score. The latter score, however, 
does not exist, since there are only six scores in the group. Since the next digit is 4, the 
fourth score in the group becomes the second score in the Subsample 2. By default, the two 
scores that remain in the group (the first and sixth scores) will comprise Subsample 3. 


3. The large sample normal approximation for the Mann-Whitney U test (i.e., Equations 12.4 
or 12.5) can be employed when the values of m, and m, used to represent n, and n, are 
such that one or both of the values is larger than the largest tabled value in Table A11. 
Equation 12.6 (the tie-corrected Mann-Whitney normal approximation equation) can be 
employed if ties are present in the data (1.e., there are one or more identical values for squared 
difference score sums). 


4. For purposes of illustration we will assume that the medians of the populations the two 
groups represent are equal (which is an assumption of the Siegel-Tukey test for equal 
variability). In actuality, the sample medians computed for Groups 1 and 2 are respectively 
7 and 8. 


5. When n, * n,, the smaller sample size is employed when using Table A9 in order to 
minimize the likelihood of committing a Type I error. 


€ 2000 by Chapman & Hall/CRC 


Test 16 
The Chi-Square Test for r x c Tables 


(Nonparametric Test Employed with Categorical/Nominal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test In the underlying population(s) represented by the sample(s) 
in a contingency table, are the observed cell frequencies different from the expected frequencies? 


Relevant background information on test The chi-square test for r x c tables is one of a 
number of tests described in this book for which the chi-square distribution is the appropriate 
sampling distribution.’ The chi-square test for r x c tables is an extension of the chi-square 
goodness-of-fit test (Test 8) to two-dimensional tables. Whereas the latter test can only be em- 
ployed with a single sample categorized on a single dimension (the single dimension is repre- 
sented by the k cells/categories that comprise the frequency distribution table), the chi-square 
test for r x c tables can be employed to evaluate designs that summarize categorical data in the 
form of an r x c table (which is often referred to as a contingency table). An r x c table consists 
of r rows and c columns. Both the values of r and c are integer numbers that are equal to or 
greater than 2. The total number of cells in an r x c table is obtained by multiplying the value 
of r by the value of c. The data contained in each of the cells of a contingency table represent 
the number of observations (i.e., subjects or objects) that are categorized in the cell. 

Table 16.1 presents the general model for an r x c contingency table. There are a total of 
n observations in the table. Note that each cell is identified with a subscript that consists of two 
elements. The first element identifies the row in which the cell falls and the second element 
identifies the column in which the cell falls. Thus, the notation O; represents the number of 


observations in the cell that is in the i" row and the j* column. O, represents the number of 


observations in the i” row and O j represents the number of observations in the j" column. 


Table 16.1 General Model for an r x c Contingency Table 


Column variable Row sums 





R 0. 
R, O, 
Row variable 
i O; 
R, O, 
Column sums 0, 0, i 0; i 0. n 
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In actuality, there are two chi-square tests that can be conducted with an r x c table. The 
two tests that will be described are the chi-square test for homogeneity (Test 16a) and the 
chi-square test of independence (Test 16b). The general label chi-square test for r x c tables 
will be employed to refer to both of the aforementioned tests, since the two tests are compu- 
tationally identical. Although, in actuality, the chi-square test for homogeneity and the chi- 
square test of independence evaluate different hypotheses, a generic hypothesis can be stated 
that is applicable to both tests. A brief description of the two tests follows. 

The chi-square test for homogeneity (Test 16a) The chi-square test for homogeneity 
is employed when r independent samples (where r » 2) are categorized on a single dimension 
which consists of c categories (where c » 2). The data for the r independent samples (which are 
generally represented by the r rows of the contingency table) are recorded with reference to the 
number of observations in each of the samples that fall within each of c categories (which are 
generally represented by the c columns of the contingency table). It is assumed that each of the 
samples is randomly drawn from the underlying population it represents. The chi-square test 
for homogeneity evaluates whether or not the r samples are homogeneous with respect to the 
proportion of observations in each of the c categories. To be more specific, if the data are 
homogeneous, the proportion of observations in the j " category will be equal in all of the r 
populations. The chi-square test for homogeneity assumes that the sums of the r rows (which 
represent the number of observations in each of the r samples) are determined by the researcher 
prior to the data collection phase of a study. Example 16.1 in Section II is employed to illustrate 
the chi-square test for homogeneity. 

The chi-square test of independence (Test 16b) The chi-square test of independence 
is employed when a single sample is categorized on two dimensions/variables. It is assumed that 
the sample is randomly selected from the population it represents. One of the dimensions/ 
variables is comprised of r categories (where r » 2) that are represented by the r rows of the 
contingency table, while the second dimension/variable is comprised of c categories (where 
c > 2) that are represented by the c columns of the contingency table. The chi-square test of 
independence evaluates the general hypothesis that the two variables are independent of one 
another. Another way of stating that two variables are independent of one another is to say that 
there is a zero correlation between them. A zero correlation indicates there is no way to predict 
at above chance in which category an observation will fall on one of the variables, if it is known 
which category the observation falls on the second variable. (For an overview of the concept of 
correlation, the reader should consult Section I of the Pearson product-moment correlation 
coefficient (Test 28).) The chi-square test of independence assumes that neither the sums of 
the r rows (which represent the number of observations in each of the r categories for Variable 
1) or the sums of the c columns (which represent the number of observations in each of the c 
categories for Variable 2) are predetermined by the researcher prior to the data collection phase 
of a study. Example 16.2 in Section II is employed to illustrate the chi-square test of inde- 
pendence. 

The chi-square test for r x c tables (i.e., both the chi-square test for homogeneity and 
the chi-square test of independence) is based on the following assumptions: a) Categorical/ 
nominal data (i.e., frequencies) for r x c mutually exclusive categories are employed in the 
analysis; b) The data that are evaluated represent a random sample comprised of n independent 
observations. This assumption reflects the fact that each subject or object can only be 
represented once in the data; and c) The expected frequency of each cell in the contingency table 
is 5 or greater. When the expected frequency of one or more cells is less than 5, the probabilities 
in the chi-square distribution may not provide an accurate estimate of the underlying sampling 
distribution. As is the case for the chi-square goodness-of-fit test, sources are not in agreement 
with respect to the minimum acceptable value for an expected frequency. Many sources employ 
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criteria suggested by Cochran (1952), who stated that none of the expected frequencies should 
be less than 1, and that no more than 20% of the expected frequencies should be less than 5. 
However, many sources suggest the latter criteria may be overly conservative. In instances 
where a researcher believes that one or more expected cell frequencies are too small, two or more 
cells can be combined with one another to increase the values of the expected frequencies. 

In actuality the chi-square distribution only provides an approximation of the exact sampling 
distribution for a contingency table.” The accuracy of the chi-square approximation increases as 
the size of the sample increases and, except for instances involving small sample sizes, the chi- 
square distribution provides an excellent approximation of the exact sampling distribution. One 
case for which an exact probability is often computed is a 2 x 2 contingency table involving a 
small sample size. In the latter instance, an exact probability can be computed through use of the 
hypergeometric distribution. The computation of an exact probability for a 2 x 2 table using the 
hypergeometric distribution is described under the Fisher exact test (Test 16c) in Section VI. 


II. Examples 


Example 16.1 A researcher conducts a study in order to evaluate the effect of noise on 
altruistic behavior. Each of the 200 subjects who participate in the experiment is randomly 
assigned to one oftwo experimental conditions. Subjects in both conditions are given a one-hour 
test which is ostensibly a measure of intelligence. During the test the 100 subjects in Group 1 
are exposed to continual loud noise, which they are told is due to a malfunctioning generator. 
The 100 subjects in Group 2 are not exposed to any noise during the test. Upon completion of 
this stage of the experiment, each subject on leaving the room is confronted by a middle-aged 
man whose arm is in a sling. The man asks the subject if she would be willing to help him carry 
a heavy package to his car. In actuality, the man requesting help is an experimental confederate 
(i.e., working for the experimenter). The number of subjects in each group who help the man is 
recorded. Thirty of the 100 subjects who were exposed to noise elect to help the man, while 60 
of the 100 subjects who were not exposed to noise elect to help the man. Do the data indicate 
that altruistic behavior is influenced by noise? 


The data for Example 16.1, which can be summarized in the form of a 2 x 2 contingency 
table, are presented in Table 16.2. 


Table 16.2 Summary of Data for Example 16.1 


Helped Did not help Row sums 
the confederate the confederate 
Noise 30 70 100 
No noise 60 40 100 
Total 
Column sums 90 110 observations 200 


The appropriate test to employ for evaluating Example 16.1 is the chi-square test for 
homogeneity. This is the case, since the design of the study involves the use of categorical data 
(i.e., frequencies for each of the r x c cells in the contingency table) with multiple independent 
samples (specifically two) that are categorized on a single dimension (altruism). To be more 
specific, the differential treatments to which the two groups are exposed (i.e., noise versus 
no-noise) constitute the independent variable. The latter variable is the row variable, since it is 
represented by the two rows in Table 16.2. Note that the researcher assigns 100 subjects to each 
of the two levels of the independent variable prior to the data collection phase of the study. This 
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is consistent with the fact that when the chi-square test for homogeneity is employed, the sums 
for the row variable are predetermined prior to collecting the data. The dependent variable is 
whether or not a subject exhibits altruistic behavior. The latter variable is represented by the two 
categories helped the confederate versus did not help the confederate. The dependent variable 
is the column variable, since it is represented by the two columns in Table 16.2. The hypothesis 
that is evaluated with the chi-square test for homogeneity is whether there is a difference be- 
tween the two groups with respect to the proportion of subjects who help the confederate. 


Example 16.2 A researcher wants to determine if there is a relationship between the 
personality dimension of introversion-extroversion and political affiliation. Two hundred people 
are recruited to participate in the study. All of the subjects are given a personality test on the 
basis of which each subject is classified as an introvert or an extrovert. Each subject is then 
asked to indicate whether he or she is a Democrat or a Republican. The data for Example 16.2, 
which can be summarized in the form of a 2 x 2 contingency table, are presented in Table 16.3. 
Do the data indicate there is a significant relationship between one's political affiliation and 
whether or not one is an introvert versus an extrovert? 


Table 16.3 Summary of Data for Example 16.2 


Democrat Republican Row sums 
Introvert 30 70 100 
Extrovert 60 40 100 
Total 
Column sums 90 110 observations 200 


The appropriate test to employ for evaluating Example 16.2 is the chi-square test of 
independence. This is the case since: a) The study involves a single sample that is categorized 
on two dimensions; and b) The data are comprised of frequencies for each of the r x c cells in 
the contingency table. To be more specific, a sample of 200 subjects is categorized on the 
following two dimensions, with each dimension being comprised of two mutually exclusive 
categories: a) introvert versus extrovert; and b) Democrat versus Republican. In Example 
16.2 the introvert-extrovert dimension is the row variable and the Democrat-Republican 
dimension is the column variable. Note that in selecting the sample of 200 subjects, the 
researcher does not determine beforehand the number of introverts, extroverts, Democrats, and 
Republicans to include in the study.^ Thus, in Example 16.2 (consistent with the use of the chi- 
square test of independence) the sums of the rows and columns (which are referred to as the 
marginal sums) are not predetermined. The hypothesis that is evaluated with the chi-square 
test of independence is whether the two dimensions are independent of one another. 


III. Null versus Alternative Hypotheses 


Even though the hypotheses evaluated with the chi-square test for homogeneity and the chi- 
square test of independence are not identical, generic null and alternative hypotheses employing 
common symbolic notation can be used for both tests. The generic null and alternative hypoth- 
eses employ the observed and expected cell frequencies in the underlying population(s) rep- 
resented by the sample(s). The observed and expected cell frequencies for the population(s) are 
represented respectively by the lower case Greek letters omicron (o) and epsilon (£). Thus, 
oi and £j respectively represent the observed and expected frequency of Cell; in the underlying 
population. 
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Null hypothesis Hy: 0j = £j for all cells 


(This notation indicates that in the underlying population(s) the sample(s) represent(s), for each 
of the r x c cells the observed frequency of a cell is equal to the expected frequency of the cell. 
With respect to the sample data, this translates into the observed frequency of each of the r x c 
cells being equal to the expected frequency of the cell.) 


Alternative hypothesis H,: 0; * £j; for at least one cell 


(This notation indicates that in the underlying population(s) the sample(s) represent(s), for at 
least one of the r x c cells the observed frequency of a cell is not equal to the expected frequency 
of the cell. With respect to the sample data, this translates into the observed frequency of at least 
one of the r x c cells not being equal to the expected frequency of the cell. This notation should 
not be interpreted as meaning that in order to reject the null hypothesis there must be a 
discrepancy between the observed and expected frequencies for all r x c cells. Rejection of the 
null hypothesis can be the result of a discrepancy between the observed and expected frequencies 
for one cell, two cells, ..., or all r x c cells.) 


Although itis possible to employ a directional alternative hypothesis for the chi-square test 
for r x c tables, in the examples used to describe the test it will be assumed that the alternative 
hypothesis will always be stated nondirectionally. A discussion of the use of a directional alter- 
native hypothesis can be found in Section VI. 

The null and alternative hypotheses for each of the two tests that are described under the 
chi-square test for r x c tables can also be expressed within the framework of a different format. 
The alternative format for stating the null and alternative hypotheses employs the proportion of 
Observations in the cells of the r x c contingency table. Before presenting the hypotheses in the 
latter format, the reader should take note of the following with respect to Tables 16.2 and 16.3. 
In both Tables 16.2 and 16.3 four cells can be identified: a) Cell,, is the upper left cell in each 
table (i.e., in Row 1 and Column 1 the cell with the observed frequency O,, = 30). In the case 
of Example 16.1, O,, = 30 represents the number of subjects exposed to noise who helped the 
confederate. In the case of Example 16.2, O,, = 30 represents the number of introverts who 
are Democrats; b) Cell,, is the upper right cell in each table (i.e., in Row 1 and Column 2 the cell 
with the observed frequency O,, = 70). In the case of Example 16.1, O,, = 70 represents the 
number of subjects exposed to noise who did not help the confederate. In the case of Example 
16.2, O,, = 70 represents the number of introverts who are Republicans; c) Cell, is the lower 
left cell in each table (i.e., in Row 2 and Column 1 the cell with the observed frequency 
O,, = 60). In the case of Example 16.1, O,, = 60 represents the number of subjects exposed 
to no noise who helped the confederate. In the case of Example 16.2, O,, = 60 represents 
the number of extroverts who are Democrats; d) Cell, is the lower right cell in each table (1.e., 
in Row 2 and Column 2 the cell with the observed frequency O,, = 40). In the case of Example 
16.1, O,, = 40 represents the number of subjects exposed to no noise who did not help the 
confederate. In the case of Example 16.2, O,, = 40 represents the number of extroverts who 
are Republicans. 


Alternative way of stating the null and alternative hypotheses for the chi-square test for 
homogeneity If the independent variable (which represents the different groups) is employed 
as the row variable, the null and alternative hypotheses can be stated as follows: 

H): Inthe underlying populations the samples represent, all of the proportions in the same 


column of the r x c table are equal 
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H . 


1: In the underlying populations the samples represent, all of the proportions in the same 


column of the r x c table are not equal for at least one of the columns 


Viewing the above hypotheses in relation to the sample data in Table 16.2, the null hypoth- 
esis states that there are an equal proportion of observations in Cell,, and Cell,,. With respect 
to the sample data, the proportion of observations in Cell, , is the proportion of subjects who are 
exposed to noise who helped the confederate (which equals O,,/O, = 30/100 = .3). The 
proportion of observations in Cell, is the proportion of subjects who are exposed to no noise 
who helped the confederate (which equals O,,/O, = 60/100 = .6). The null hypothesis also 
requires an equal proportion of observations in Cell,, and Cell,,. The proportion of observations 
in Cell; is the proportion of subjects who are exposed to noise who did not help the 
confederate (which equals O,,/O, — 70/100 =.7). The proportion of observations in Cell; is 
the proportion of subjects who are exposed to no noise who did not help the confederate 
(which equals O,,/O0, = 40/100 = .4). 


Alternative way of stating the null and alternative hypotheses for the chi-square test of 
independence The null and alternative hypotheses for the chi-square test of independence can 
be stated as follows: 


A: Tig = (n; )(n ;) for all r x c cells. 


H: Ty * (n; )(n j) for at least one cell. 
Where: represents the value of a proportion in the population 


The above notation indicates that if the null hypothesis is true, in the underlying population 
represented by the sample for each of the r x c cells, the proportion of observations in a cell will 
equal the proportion of observations in the row in which the cell appears multiplied by the 
proportion of observations in the column in which the cell appears. This will now be illustrated 
with respect to Example 16.2. In illustrating the relationship described in the null hypothesis, 
the notation p is employed to represent the relevant proportions obtained for the sample data. 
If the null hypothesis is true, in the case of Cell,, itis required that the proportion of observations 
in Cell,, is equivalent to the product of the proportion of observations in Row 1 (which equals 
DP, = O, /n = 100/200 = .5) and the proportion of observations in Column 1 (which equals 
Dp, = Oj/n = 90/200 = .45). The result of multiplying the row and column proportions is 
(p, (p) = C5)C45) = .225. Thus, if the null hypothesis is true, the proportion of 
observations in Cell,, must equal p,, =.225.° Consequently, if the value .225 is multiplied by 
200, which is the total number of observations in Table 16.3, the resulting value 
(P,,)(2) = (.225)(200) = 45 is the number of observations that is expected in Cell, if the null 
hypothesis is true. The same procedure can be used for the remaining three cells to determine 
the number of observations that are required in each cell in order for the null hypothesis to be 
supported. In Section IV these values, which in actuality correspond to the expected frequencies 
of the cells, are computed for each of the four cells in Table 16.3 (as well as Table 16.2). 


IV. Test Computations 
The computations for the chi-square test for r x c tables will be described for Example 16.1. 
The procedure to be described in this section when applied to Example 16.2 yields the identical 


result since: a) The computational procedure for the chi-square test of independence is 
identical to that employed for the chi-square test for homogeneity; and b) The identical data are 


© 2000 by Chapman & Hall/CRC 


employed for Examples 16.1 and 16.2. Table 16.4 summarizes the data and computations for 
Example 16.1. 


Table 16.4 Chi-Square Summary Table for Example 16.1 


(O; ~ E, » 
cen O; E; (0 - E) (0 - Ey” E cU 
1J 
Cell;, — Noise/Helped 
the confederate 30 45 -15 225 5.00 
Cell; — Noise/Did not 
help the confederate 70 55 15 225 4.09 
Cell, — No noise/Helped 
the confederate 60 45 15 225 5.00 
Cell,, — No noise/Did not 
help the confederate 40 55 -15 225 4.09 
YO; - 200 YE; - 200 X(0,; = E;) -0 X) = 18.18 


The observed frequency of each cell (0;;) is listed in Column 2 of Table 16.4. Column 3 
contains the expected cell frequencies (E;;)- In order to conduct the chi-square test for r x c 
tables, the observed frequency for each cell must be compared with its expected frequency. In 
order to determine the expected frequency of a cell, the data should be arranged in a contingency 
table that employs the format of Table 16.2. The following protocol is then employed to de- 
termine the expected frequency of a cell: a) Multiply the sum of the observations in the row in 
which the cell appears by the sum of the observations in the column in which the cell appears; 
b) Divide n, the total number of observations, into the product that results from multiplying the 
row and column sums for the cell. 

The computation of an expected cell frequency can be summarized by Equation 16.1. 


|. (0, XO ) 
n 


E. 


ü (Equation 16.1) 

Applying Equation 16.1 to Cell,, in Table 16.2 (i.e., noise/helped the confederate), the 
expected cell frequency can be computed as follows. The row sum is the total number of subjects 
who were exposed to noise. Thus, O, - 100. The column sum is the total number of subjects 
who helped the confederate. Thus, O , - 90. Employing Equation 16.1, the expected fre- 
quency for Cell;, can now be computed: £j, = [(O; XO ,)]/n = [(100)(90)]/200 = 45. The 
expected frequencies for the remaining three cells in the 2 x 2 contingency table that summarizes 
the data for Example 16.1 are computed below: 


Cell, = [(O, XO ,)/n = [(100)(110)}/200 = 55 
Cell, = [(O, XO ,)/n = [(100)(90)]/200 = 45 
Cell, = [(O, (O.,)/n = [(100)(110)}/200 = 55 


Upon determining the expected cell frequencies, the test statistic for the chi-square test for 
r xc tables is computed with Equation 16.2.° 


P E 2 
(Oy - E) 


(Equation 16.2) 








i-l j=l ij 
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The operations described by Equation 16.2 (which are the same as those described for 
computing the chi-square statistic for the chi-square goodness-of-fit test) are as follows: a) The 
expected frequency of each cell is subtracted from its observed frequency (summarized in 
Column 4 of Table 16.4); b) For each cell, the difference between the observed and expected fre- 
quency is squared (summarized in Column 5 of Table 16.4); c) For each cell, the squared 
difference between the observed and expected frequency is divided by the expected frequency 
of the cell (summarized in Column 6 of Table 16.4); and d) The value of chi-square is computed 
by summing all of the values in Column 6. For Example 16.1, Equation 16.2 yields the value 
x? = 18.18.’ 

Note that in Table 16.4 the sums of the observed and expected frequencies are identical. 
This must always be the case, and any time these sums differ from one another, it indicates that 
a computational error has been made. It is also required that the sum of the differences between 
the observed and expected frequencies equals zero (i.e., 3X0; - E,) = 0). Any time the latter 
value does not equal zero, it indicates an error has been made. Since all of the (O; = E) values 
are squared in Column 5, the sum of Column 6, which represents the value of x”, must always 
be a positive number. If a negative value for chi-square is obtained, it indicates that an error has 
been made. The only time x? will equal zero is when O; = E; for all r x c cells. 


V. Interpretation of the Test Results 


The obtained value x? = 18.18 is evaluated with Table A4 (Table of the Chi-Square Dis- 
tribution) in the Appendix. A general discussion of the values in Table A4 can be found in 
Section V of the single-sample chi-square test for a population variance (Test 3). When the 
chi-square distribution is employed to evaluate the chi-square test for r x c tables, the degrees 
of freedom employed for the analysis are computed with Equation 16.3. 


df = (r - (c - 1) (Equation 16.3) 


The tabled critical values in Table A4 for the chi-square test for r x c tables are always 
derived from the right tail of the distribution. The critical chi-square value for a specific value 
of alpha is the tabled value at the percentile that corresponds to the value (1 — a). Thus, the 
tabled critical .05 chi-square value (to be designated os) is the tabled value at the 95th 
percentile. In the same respect, the tabled critical .01 chi-square value (to be designated Xo1) 
is the tabled value at the 99th percentile. In order to reject the null hypothesis, the obtained value 
of chi-square must be equal to or greater than the tabled critical value at the prespecified level 
of significance. The aforementioned guidelines for determining tabled critical chi-square values 
are employed when the alternative hypothesis is stated nondirectionally (which, as noted earlier, 
is generally the case for the chi-square test for r x c tables). The determination of tabled critical 
chi-square values in reference to a directional alternative hypothesis is discussed in Section VI. 

The guidelines for a nondirectional analysis will now be applied to Example 16.1. Since 
r = 2 and c = 2, the degrees of freedom are computed to be df= (2 — 1)2 - 1) 2 1. The tabled 
critical .05 chi-square value for df= 1 is Ys - 3.84, which as noted above is the tabled chi- 
square value at the 95th percentile. The tabled critical .01 chi-square value for df = 1 is 
Xo = 6.63, which as noted above is the tabled chi-square value at the 99th percentile. Since 
the computed value X? = 18.18 is greater than both of the aforementioned critical values, the 
null hypothesis can be rejected at both the .05 and .01 levels. Rejection of the null hypothesis 
at the .01 level can be summarized as follows: y7(1) = 18.18, p < .01. 

The significant chi-square value obtained for Example 16.1 indicates that subjects who 
served in the noise condition helped the confederate significantly less than subjects who served 
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in the no noise condition. This can be confirmed by visual inspection of Table 16.2, which re- 
veals that twice as many subjects who served in the no noise condition helped the confederate 
than subjects who served in the noise condition. 

As noted previously, the chi-square analysis described in this section also applies to Ex- 
ample 16.2, since the latter example employs the same data as Example 16.1. Thus, with respect 
to Example 16.2, the significant y? = 18.18 value allows the researcher to conclude that a 
subject’s categorization on the introvert—extrovert dimension is associated with (1.e., not inde- 
pendent of) one's political affiliation. This can be confirmed by visual inspection of Table 16.3, 
which reveals that introverts are more likely to be Republicans whereas extroverts are more 
likely to be Democrats. 

Itis important to note that Example 16.2 represents a correlational study, and as such does 
not allow a researcher to draw any conclusions with regard to cause and effect. To be more 
specific, the study does not allow one to conclude that a subject's categorization on the 
personality dimension introvert-extrovert is the cause of one's political affiliation (Democrat 
versus Republican), or vice versa (i.e., that political affiliation causes one to be an introvert 
versus an extrovert). Although it is possible that the two variables employed in a correlational 
study are causally related to one another, such studies do not allow one to draw conclusions 
regarding cause and effect, since they fail to control for the potential influence of confounding 
variables. Because of this, when studies which are evaluated with the chi-square test of 
independence (such as Example 16.2) yield a significant result, one can only conclude that 
in the underlying population the two variables have a correlation with one another that is 
some value other than zero (which is not commensurate with saying that one variable causes the 
other). 

Studies such as that represented by Example 16.2 can also be conceptualized within the 
framework of a natural experiment (also referred to as an ex post facto study) which is dis- 
cussed in the Introduction of the book. In the latter type of study, one of the two variables is 
designated as the independent variable, and the second variable as the dependent variable. The 
independent variable is (in contrast to the independent variable in a true experiment) a non- 
manipulated variable. A subject's score (or category in the case of Example 16.2) on a non- 
manipulated independent variable is based on some preexisting subject characteristic, rather than 
being a direct result of some manipulation on the part of the experimenter. Thus, if in Example 
16.2 the introvert-extrovert dimension is designated as the independent variable, it represents 
a nonmanipulated variable, since the experimenter does not determine whether or not a subject 
becomes an introvert or an extrovert. Which of the two aforementioned categories a subject 
falls into is determined beforehand by "nature" (thus the term natural experiment). The same 
logic also applies if political affiliation is employed as the independent variable, since, like 
introvert-extrovert, the Democrat-Republican dichotomization is a preexisting subject char- 
acteristic. 

In Example 16.1, however, the independent variable, which is whether or not a subject is 
exposed to noise, is a manipulated variable. This is the case, since the experimenter randomly 
determines those subjects who are assigned to the noise condition and those who are assigned to 
the no noise condition. As noted in the Introduction, an experiment in which the researcher 
manipulates the level of the independent variable to which a subject is assigned is referred to as 
a true experiment. In the latter type of experiment, by virtue of randomly assigning subjects to 
the different experimental conditions, the researcher is able to control for the effects of potentially 
confounding variables. Because of this, if a significant result is obtained in a true experiment, 
a researcher is justified in drawing conclusions with regard to cause and effect. 
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VI. Additional Analytical Procedures for the Chi-Square Test 
for r x c Tables and/or Related Tests 


1. Yates' correction for continuity In Section I it is noted that, in actuality, the chi-square 
test for r x c tables employs a continuous distribution to approximate a discrete probability 
distribution. Under such conditions, some sources recommend that a correction for continuity 
be employed. As noted previously in the book, the correction for continuity is based on the 
premise that if a continuous distribution is employed to estimate a discrete distribution, such an 
approximation will inflate the Type I error rate. By employing the correction for continuity, the 
Type I error rate is ostensibly adjusted to be more compatible with the prespecified alpha level 
designated by the researcher. Sources that recommend the correction for continuity for the chi- 
square test for r x c tables only recommend that it be employed in the case of 2 x 2 contingency 
tables. Equation 16.4 (which was developed by Yates (1934)) is the continuity-corrected chi- 
square equation for 2 x 2 tables. 





(Equation 16.4) 





TET 
ij 


Note that by subtracting .5 from the absolute value of the difference between each set of 
Observed and expected frequencies, the chi-square value derived with Equation 16.4 will be 
lower than the value computed with Equation 16.2. 

Statisticians are not in agreement with respect to whether it is prudent to employ the 
correction for continuity described by Equation 16.4 with a 2 x 2 contingency table. To be more 
specific, various sources take the following positions with respect to what the most effective 
strategy is for evaluating 2 x 2 tables: a) Most sources agree that when the sample size for a 
2 x 2 table is small (generally less than 20), the Fisher exact test (which is described later in this 
section) should be employed instead of the chi-square test for r x c tables. Cochran (1952, 
1954) stated that in the case of 2 x 2 tables, the chi-square test for r x c tables should not be 
employed when n « 20, and that when 20 « n « 40 the test should only be employed if all of the 
expected frequencies are at least equal to 5. Additionally, when n > 40 all expected frequencies 
should be equal to or greater than 1; b) Some sources recommend that for small sample sizes 
Yates’ correction for continuity be employed. This recommendation assumes that the size of 
the sample is at least equal to 20 (since, when n « 20, the Fisher exact test should be employed), 
but less than some value that defines the maximum size of a small sample size with respect to the 
use of Yates' correction. Sources do not agree on what value of n defines the upper limit beyond 
which Yates’ correction is not required; c) Some sources recommend that Yates’ correction for 
continuity should always be employed with 2 x 2 tables, regardless of the sample size; d) To 
further confuse the issue, many sources take the position that Yates’ correction for continuity 
should never be used, since the chi-square value computed with Equation 16.4 results in an 
overcorrection — i.e., it results in an overly conservative test; and e) Haber (1980, 1982) argues 
that alternative continuity correction procedures (including one developed by Haber) are superior 
to Yates' correction for continuity. Haber's (1980, 1982) continuity correction procedure is 
described in Zar (1999, pp. 494—495). 

Table 16.5 illustrates the application of Yates’ correction for continuity with Example 
16.1. By employing Equation 16.4 the obtained value of chi-square is reduced to 16.98 (in 
contrast to the value X? = 18.18 obtained with Equation 16.2). Since the obtained value 
x? = 16.98 is greater than both Xs - 3.84 and Xo = 6.83, the null hypothesis can still be 
rejected at both the .05 and .01 levels. Thus, in this instance Yates’ correction for continuity 
leads to the same conclusions as those reached when Equation 16.2 is employed. 
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Table 16.5 Chi-Square Summary Table for Example 16.1 Employing 
Yates’ Correction for Continuity 


Cell o E, (|0, -E,| -.5) (|O;; -E,| - .5} Le a er 
ij ij ij ij i ij ij 7 E. 
1J 
Cell,, — Noise/ 
Helped the 
confederate 30 45 14.5 210.25 4.67 
Cell,, — Noise/ 
Did not help the 
confederate 70 55 14.5 210.25 3.82 
Cell,, — No noise/ 
Helped the 
confederate 60 45 14.5 210.25 4.67 
Cell,, — No noise/ 
Did not help the 
confederate 40 55 14.5 210.25 3.82 
YO, -200 X E, - 200 X? = 16.98 


2. Quick computational equation for a 2 x 2 table Equation 16.5 is a quick computational 
equation that can be employed for the chi-square test for r x c tables in the case of a 2 x 2 table. 
Unlike Equation 16.2, it does not require that the expected cell frequencies be computed. The 
notation employed in Equation 16.5 is based on the model for a 2 x 2 contingency table 
summarized in Table 16.6. 


Table 16.6 Model for 2 x 2 Contingency Table 


Column 1 Column 2 Row sums 
Row 1 a b atb=n, 
Row 2 c d c+d=n, 
Column sums a+c b+d n 
B 2 
ge Ie o (Equation 16.5) 


(a + bc + da + cy (b + d) 


Where: a,b,c, and d represent the number of observations in the relevant cell 


Using the model depicted in Table 16.6, by employing the appropriate observed cell fre- 
quencies for Examples 16.1 and 16.2, we know that a = 30, b = 70, c = 60, and d = 40. 
Substituting these values in Equation 16.5, the value X? = 18.18 is computed (which is the same 
chi-square value that is computed with Equation 16.2). 


2. 200[(30)(40) - (70(60)P - 18.18 
(30 + 70)(60 + 40)(30 + 60)(70 + 40) 


If Yates’ correction for continuity is applied to a 2 x 2 table, Equation 16.6 is the 
continuity-corrected version of Equation 16.5. 


2 _ Mad = be|- 5m? — 


(Equation 16.6) 
(a + b\(c + Ala + cb + d) 


X 
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Substituting the data for Examples 16.1 and 16.2 in Equation 16.6, the value X? = 16.98 
is computed (which is the same continuity-corrected chi-square value computed with Equation 
16.4). 


2 _ 200[|(30)(40) - (70(60)| - (.5)(200)/? - 16.98 

(30 + 70)(60 + 40)(30 + 60)(70 + 40) l 

3. Evaluation of a directional alternative hypothesis in the case of a 2 x 2 contingency table 

In the case of 2 x 2 contingency tables it is possible to employ a directional/one-tailed alternative 

hypothesis. Prior to reading this section, the reader may find it useful to review the relevant 
material on this subject in Section VII of the chi-square goodness-of-fit test. 

In the case of a 2 x 2 contingency table, it is possible to make two directional predictions. 

In stating the null and alternative hypotheses, the following notation (in reference to the sample 

data) based on the model for a 2 x 2 contingency table described in Table 16.6 will be employed. 

a 


MOS Es 


Pae c c 
? c+d n, 








8 |s 


The value p, represents the proportion of observations in Row 1 that falls in Cell a, while 
the value p, represents the proportion of observations in Row 2 that falls in Cell c. The anal- 
ogous proportions in the underlying populations that correspond to p, and p, will be represented 
by the notation x, and x5. Thus, 7, represents the proportion of observations in Row 1 in the 
underlying population that falls in Cell a, while the proportion 7. represents the proportion of 
Observations in Row 2 in the underlying population that falls in Cell c. Employing the afore- 
mentioned notation, it is possible to make either of the two following directional predictions for 
a 2 x 2 contingency table. 

a) In the underlying population(s) the sample(s) represent, the proportion of observations 
in Row 1 that falls in Cell a is greater than the proportion of observations in Row 2 that falls in 
Cell c. The null hypothesis and directional alternative hypothesis for this prediction are stated 
as follows: Hy: ™, = T, versus H,: 7, > m,. With respect to Example 16.1, the latter alter- 
native hypothesis predicts that a larger proportion of subjects in the noise condition will help the 
confederate than subjects in the no noise condition. In Example 16.2, the alternative hypothesis 
predicts that a larger proportion of introverts will be Democrats rather than extroverts. 

b) In the underlying population(s) the sample(s) represent, the proportion of observations in 
Row 1 that falls in Cell a is less than the proportion of observations in Row 2 that falls in Cell c. 
The null hypothesis and directional alternative hypothesis for this prediction are stated as follows: 
Hy T, = m, versus H: m; < Ty. With respect to Example 16.1, the latter alternative 
hypothesis predicts that a larger proportion of subjects in the no noise condition will help the 
confederate than subjects in the noise condition. In Example 16.2, the alternative hypothesis 
predicts that a larger proportion of extroverts will be Democrats rather than introverts. 

As is the case for the chi-square goodness-of-fit test, if a researcher wants to evaluate a 
one-tailed alternative hypothesis at the .05 level, the appropriate critical value to employ is Xoo , 
which is the tabled chi-square value at the .10 level of significance. The latter value is repre- 
sented by the tabled chi-square value at the 90th percentile (which demarcates the extreme 1096 
in the right tail of the chi-square distribution). If a researcher wants to evaluate a one-tailed/ 
directional alternative hypothesis at the .01 level, the appropriate critical value to employ is 
X08: which is the tabled chi-square value at the .02 level of significance. The latter value is 
represented by the tabled chi-square value at the 98th percentile (which demarcates the extreme 
2% in the right tail of the chi-square distribution). 

If a one-tailed alternative hypothesis is evaluated for Examples 16.1 and 16.2, from Table 
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A4 it can be determined that for df = 1 the relevant tabled critical one-tailed .05 and .01 values 
are Xoo = 2.71 and X08 = 5.43 ? Note that when one employs a one-tailed alternative hypoth- 
esis it is easier to reject the null hypothesis, since the one-tailed .05 and .01 critical values are less 
than the two-tailed .05 and .01 values (which for df= 1 are x^ = 3.84 and y^, = 6.63).!° In 
conducting a one-tailed analysis, it is important to note, however, if the obtained value of chi- 
square is equal to or greater than the tabled critical value at the prespecified level of significance, 
only one of the two possible alternative hypotheses can be supported. The alternative hypothesis 
that is supported is the one that is consistent with the data. 

Since for Examples 16.1 and 16.2, the computed value x? = 18.18 is greater than both of 
the one-tailed critical values Kao = 2.71 and Xas = 5.43, the null hypothesis can be rejected 
at both the .05 and .01 levels, but only if the directional alternative hypothesis H,: 1, < m, is 
employed. If the directional hypothesis H,: =, > m, is employed, the null hypothesis cannot 
be rejected, since the data are not consistent with the latter alternative hypothesis. 

When evaluating contingency tables in which the number of rows and/or columns is greater 
than two, it is possible to run a multi-tailed test, (which as noted in the discussion of the chi- 
square goodness-of-fit test is the term that is sometimes used when there are more than two 
possible directional alternative hypotheses). It would be quite unusual to encounter the use of 
a multi-tailed analysis for an r x c table, which requires that a researcher determine all possible 
directional patterns/ordinal configurations for a set of data, and then predict one or more of the 
specific patterns that will occur. In the event one elects to conduct a multi-tailed analysis, the 
determination of the appropriate critical values is based on the same guidelines that are discussed 
for multi-tailed tests under the chi-square goodness-of-fit test. 


4. Test 16c: The Fisher exact test In Section I it is noted that the chi-square distribution 
provides an approximation of the exact sampling distribution for a contingency table. In the case 
of 2 x 2 tables, the chi-square distribution is employed to approximate the hypergeometric 
distribution which will be discussed in this section. (The hypergeometric distribution is discussed 
in detail in Section IX (the Addendum) of the binomial sign test for a single sample (Test 9).) 
As noted earlier, when n « 20 most sources recommend that the Fisher exact test (which 
employs exact hypergeometric probabilities) be employed to evaluate a 2 x 2 contingency table. 
Table 16.6, which is used earlier in this section to summarize a 2 x 2 table, will be employed to 
describe the model for the hypergeometric distribution upon which the Fisher exact test is based. 

According to Daniel (1990) the Fisher exact test, which is also referred to as the Fisher- 
Irwin test, was simultaneously described by Fisher (1934, 1935), Irwin (1935), and Yates 
(1934). The test shares the same assumptions as those noted for the chi-square test for r x c 
tables, with the exception of the assumption regarding small expected frequencies (which reflects 
the limitations of the latter test with small sample sizes). Many sources note that an additional 
assumption of the Fisher exact test is that both the row and column sums of a2 x 2 contingency 
table are predetermined by the researcher. In truth, this latter assumption is rarely met and, 
consequently, the test is used with 2 x 2 contingency tables involving small samples sizes when 
one or neither of the marginal sums is predetermined by the researcher. The Fisher exact test 
is more commonly employed with the model described for the chi-square test of homogeneity 
than it is with the model described for the chi-square test of independence. 

Equation 16.7, which is the equation for a hypergeometrically distributed variable, allows 
for the computation of the exact probability (P) of obtaining a specific set of observed frequencies 
in a 2 x 2 contingency table. Equation 16.7, which uses the notation for the Fisher exact test 
model, is equivalent to Equation 9.15, which is more commonly employed to represent the general 
equation for a hypergeometrically distributed variable. Since Equation 16.7 involves the compu- 
tation of combinations, the reader may find it useful to review the discussion of combinations in 
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Section IV of the binomial sign test for a single sample. 


WM 
pod Au PA 9.5 (Equation 16.7) 


ta 


Equation 16.8 is a computationally more efficient form of Equation 16.7 which yields the 
same probability value. 


p- (à * Ot (b * d) (a * b! (c * d (Equation 16.8) 
n! a! b! c! d! 


Example 16.3 (which is a small sample size version of Example 16.1) will be employed to 
illustrate the Fisher exact test. 


Example 16.3 A researcher conducts a study in order to evaluate the effect of noise on 
altruistic behavior. Each of the 12 subjects who participate in the experiment is randomly 
assigned to one of two experimental conditions. Subjects in both conditions are given a one-hour 
test which is ostensibly a measure of intelligence. During the test the six subjects in Group 1 are 
exposed to continual loud noise, which they are told is due to a malfunctioning generator. The 
six subjects in Group 2 are not exposed to any noise during the test. Upon completion of this 
stage of the experiment, each subject on leaving the room is confronted by a middle-aged man 
whose arm is in a sling. The man asks the subject if she would be willing to help him carry a 
heavy package to his car. In actuality, the man requesting help is an experimental confederate 
(i.e., working for the experimenter). The number of subjects in each group who help the man is 
recorded. One of the six subjects who were exposed to noise elects to help the man, while five 
of the six subjects who were not exposed to noise elect to help the man. Do the data indicate that 
altruistic behavior is influenced by noise? 


The data for Example 16.3, which can be summarized in the form of a 2 x 2 contingency 
table, are presented in Table 16.7. 


Table 16.7 Summary of Data for Example 16.3 


Helped the Did not help Row sums 
confederate the confederate 
Noise a=1 b=5 at+b=n,=6 
No noise c=5 d=1 ct+d=n,=6 
Column sums at+c=6 b+d=6 n=12 


The null and alternative hypotheses for the Fisher exact test are most commonly stated 
using the format described in the discussion of the evaluation of a directional alternative 
hypothesis for a2 x 2 contingency table. Thus, the null hypothesis and nondirectional alternative 
hypotheses are as follows: 


(In the underlying populations the samples represent, the proportion of observations in Row 1 (the 
noise condition) that falls in Cell a is equal to the proportion of observations in Row 2 (the no 
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noise condition) that falls in Cell c.) 
Hi: 1, # T, 


(In the underlying populations the samples represent, the proportion of observations in Row 1 
(the noise condition) that falls in Cell a is not equal to the proportion of observations in Row 2 
(the no noise condition) that falls in Cell c.) 


The alternative hypothesis can also be stated directionally, as described in the discussion 
of the evaluation of a directional alternative hypothesis for a 2 x 2 contingency table — i.e., 
Him > m, or H: mt < m." 

Employing Equations 16.7 and 16.8, the probability of obtaining the specific set of 
observed frequencies in Table 16.7 is computed to be P = .039. 


Equation 16.7: 
a (5 | 6! | 6! | 
Pe 1 3J (1151511! _ 039 


Equation 16.8: 


pe D 
12! 11 5! 5! 1! 


In order to evaluate the null hypothesis, in addition to the probability P = .039 (which is the 
probability of obtaining the set of observed frequencies in Table 16.7) it is also necessary to 
compute the probabilities for any sets of observed frequencies that are even more extreme than 
the observed frequencies in Table 16.7. The only result that is more extreme than the result 
summarized in Table 16.7 is if all six subjects in the no noise condition helped the confederate, 
while all six subjects in the noise condition did not help the confederate. Table 16.8 sum- 
marizes the observed frequencies for the latter result. 


Table 16.8 Most Extreme Possible Set of Observed Frequencies 
for Example 16.3 


Helped the Did not help Row sums 
confederate the confederate 
Noise a=0 b=6 at+b=n,=6 
No noise c=6 d=0 c+d=n,=6 
Column sums at+c=6 b+d=6 n=12 


Employing Equations 16.7 and 16.8, the probability of obtaining the set of observed fre- 
quencies in Table 16.8 is computed to be P = .001. 


Equation 16.7: 
[o] (5 | 6! | 6! | 
P- 0 6] _ (01 6!j/61 0l] _ 001 


12 12! 
6 6! 6! 
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Equation 16.8: 


p- 6161616! _ qu 
12! 0! 6! 6! 0! 


When P = .001 (the probability of obtaining the set of observed frequencies in Table 16.8) 
is added to P = .039, the resulting probability represents the likelihood of obtaining a set of 
observed frequencies that is equal to or more extreme than the set of observed frequencies in 
Table 16.7. The notation P, will be used to represent the latter value. Thus, in our example, 
P, = .039 + .001 = .04.” 

The following guidelines are employed for Example 16.3 in evaluating the null hypothesis 
for the Fisher exact test. 

a) If the nondirectional alternative hypothesis H,: 1, + T, is employed, the value of 
P, (i.e., the probability of obtaining a set of observed frequencies equal to or more extreme than 
the set obtained in the study) must be equal to or less than a/2. Thus, if the prespecified value 
of alpha is a = .05, the obtained value of P, must be equal to or less than .05/2 = .025. If the 
prespecified value of alpha is a = .01, the obtained value of P, must be equal to or less than 
.01/2 = .005. 

b) If a directional alternative hypothesis is employed, the observed set of frequencies for 
the study must be consistent with the directional alternative hypothesis, and the value of P, must 
be equal to or less than the prespecified value of alpha. Thus, if the prespecified value of alpha 
is a = .05, the obtained value of P, must be equal to or less than .05. If the prespecified value 
of alpha is a = .01, the obtained value of P, must be equal to or less than .01. 

Employing the above guidelines, the following conclusions can be reached. 

If a = .05, the nondirectional alternative hypothesis H,: T, * m, is not supported, since 
the obtained value P} = .04 is greater than .05/2 = .025. 

The directional alternative hypothesis H,: t, < T, is supported, but only at the .05 level. 
This is the case, since the data are consistent with the latter alternative hypothesis, and the ob- 
tained value P; - .04 isless than a — .05. The latter alternative hypothesis is not supported at 
the .01 level, since P} = .04 is greater than a = .01. 

The directional alternative hypothesis H,: m; > m, is not supported, since it is not 
consistent with the data. In order for the data to be consistent with the alternative hypothesis 
H: T, > T, it is required that a larger proportion of subjects in the noise condition helped 
the confederate than subjects in the no noise condition. 

To further clarify how to interpret a directional versus a nondirectional alternative hypoth- 
esis, consider Table 16.9 which presents all seven possible outcomes of observed cell frequencies 
for n = 12 in which the marginal sums (i.e., the row and column sums) equal six (which are the 
values for the marginal sums in Example 16.3). 

The sum of the probabilities for the seven outcomes presented in Table 16.9 equals 1. This 
is the case, since the seven outcomes represent all the possible outcomes for the cell frequencies 
if the marginal sum of each of row and column equals six. As noted earlier, if a researcher 
evaluates the directional alternative hypothesis H,: v, < m, for Example 16.3, he will only be 
interested in Outcomes 1 and 2. The combined probability for the latter two outcomes is 
P, = .04, which is less than the one-tailed value a = .05 (which represents the extreme 5% of 
the sampling distribution in one of the two tails of the distribution). Since the data are consistent 
with the directional alternative hypothesis H,: t, < m, and P, = .04 is less than a = .05, the 
latter alternative hypothesis is supported. 

If, however, the nondirectional alternative hypothesis H,: x, # 1, isemployed, in addition 
to considering Outcomes 1 and 2, the researcher must also consider Outcomes 6 and 7, which 
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Table 16.9 Possible Outcomes for Observed Cell Frequencies 
If All Marginal Sums Equal 6, When n = 12 


Outcome 1: P = .001 
Col.1 Col2 Row sums 


Row 1 0 6 6 
Row 2 6 0 6 
Column sums 6 6 12 


Outcome 3: P = .243 
Col.1 Col2 Row sums 


Row 1 2 4 6 
Row 2 4 2 6 
Column sums 6 6 12 


Outcome 5: P = .243 
Col.1 Col.2 Row sums 


Outcome 2: P = .039 
Col. 1 Col.2 Row sums 


Row 1 1 5 6 
Row 2 5 1 6 
Column sums 6 6 12 


Outcome 4: P = .433 
Col. 1 Col.2 Row sums 


Row 1 3 3 6 
Row 2 3 3 6 
Column sums 6 6 12 


Outcome 6: P = .039 
Col.1 Col.2 Row sums 


Row 1 4 2 6 Row 1 5 1 6 
Row 2 2 4 6 Row 2 1 5 6 
Column sums 6 6 12 Column sums 6 6 12 


Outcome 7: P = .001 
Col.1 Col2 Row sums 


Row 1 6 0 6 
Row 2 0 6 6 
Column sums 6 6 12 


are the analogous extreme outcomes in the opposite tail of the distribution. The latter set of out- 
comes also has a combined probability of .04. If the probability associated with Outcomes 1 
and 2 and Outcomes 6 and 7 are summed, the resulting value P, = .04 + .04 = .08 represents 
the likelihood in both tails of the distribution of obtaining an outcome equal to or more extreme 
than the outcome observed in Table 16.7. Since the value P} = .08 is greater than the two-tailed 
value a = .05, the nondirectional alternative hypothesis H,: m, # m, is not supported. This is 
commensurate with saying that in order for the latter alternative hypothesis to be supported at the 
.05 level, in each of the tails of the distribution the maximum permissible probability value for 
outcomes that are equivalent to or more extreme than the observed outcome cannot be greater 
than the value .05/2 2 .025. As noted earlier, since the computed probability in the relevant tail 
of the distribution equals .04 (which is greater than .05/2 = .025), the nondirectional alternative 
hypothesis is not supported. 

If the directional alternative hypothesis H,: x, > m, is employed the researcher is inter- 
ested in Outcomes 6 and 7. As is the case for Outcomes 1 and 2, the combined probability for 
these two outcomes is Pa .04, which is less than the one-tailed value a = .05 (which 
represents the extreme 596 of the sampling distribution in the other tail of the distribution). 
However, since the data are not consistent with the directional alternative hypothesis 
H: T, > T, itis not supported. 

To compare the results of the Fisher exact test with those that will be obtained if Example 
16.3 is evaluated with the chi-square test for r x c tables, Equation 16.2 will be employed 
to evaluate the data in Table 16.7. Table 16.10 summarizes the chi-square analysis. Note that 
the expected frequency of each cell is 3, since when Equation 16.1 is employed, the value 
E; = [(6)(6)]/12 = 3 is computed for all r x c = 4 cells. 
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Table 16.10 Chi-Square Summary Table for Example 16.3 


Cell o E (0, - E) (0, - EX (Qj - Ey 
1J iJ 1J 1J iJ iJ E. 
1J 
Cell;, — Noise/Helped 
the confederate 1 3 -2 4 1.33 
Cell,, — Noise/Did not 
help the confederate 5 3 2 4 1.33 
CelL, — No noise/Helped 
the confederate 5 3 2 4 1.33 
Cell,, — No noise/Did not 
help the confederate 1 3 -2 4 1.33 
YO; = 12 YE; = 12 (0, - E) -0 2532 


Since for df= 1, the obtained value X? = 5.32 is greater than the tabled critical two-tailed 
.05 value Xs = 3.84, the nondirectional alternative hypothesis H,: =, * T, is supported at the 
.05 level. It is not, however, supported at the .01 level, since X? = 5.32 is less than the tabled 
critical two-tailed .01 value Con = 6.63. 

The directional alternative hypothesis H,: 1, < 7, is supported at the .05 level, since the 
obtained value x? = 5.32 is greater than the tabled critical one-tailed .05 value Xoo =2.71. The 
latter directional alternative hypothesis falls just short of being supported at the .01 level, since 
x? = 5.32 is less than the tabled critical one-tailed .01 value Xag = 5.43. 

Note that when the Fisher exact test is employed, the nondirectional alternative hypothesis 
H,: T, * T, is not supported at the .05 level, yet it is supported at the .05 level when the chi- 
square test for r x c tables is used. Both the Fisher exact test and chi-square test for r x c 
tables allow the researcher to reject the null hypothesis at the .05 level if the directional alter- 
native hypothesis H,: 1, < m, isemployed. However, whereas the latter alternative hypothesis 
falls just short of significance at the .01 level when the chi-square test for r x c tables is used, 
it is further removed from being significant when the Fisher exact test is employed. The dis- 
crepancy between the two tests when they are applied to the same set of data involving a small 
sample size suggests that the chi-square approximation underestimates the actual probability 
associated with the observed frequencies and, consequently, increases the likelihood of com- 
mitting a Type I error. 

On the other hand, the result for the chi-square test for r x c tables for Example 16.3 will 
be totally consistent with the result obtained with the Fisher exact test if Yates’ correction for 
continuity is employed in evaluating the data. If Yates’ correction is employed, all of the 
(O; = E) values in Table 16.10 become 1.5, which, when squared, equals 2.25. When the latter 
value is divided by the expected cell frequency of 3, it yields .75 which will be the entry for each 
cell in the last column of Table 16.10. The sum of the values in the latter column is X? = 3, 
which is the continuity-corrected chi-square value. Since X? = 3 is less than the tabled critical 
two-tailed values Xos = 3.84 and X5i = 6.63, the nondirectional alternative hypothesis is 
not supported. Since X? = 3 is greater than the tabled critical one-tailed .05 value X00 =2.71 
but less than the tabled critical .01 one-tailed value Xag = 5.43, the directional alternative 
hypothesis that is consistent with the data (H,: t, < m, )is supported, but only at the .05 level. 
This is the same conclusion that is reached when the Fisher exact test is employed. 


5. Test 16d: The z test for two independent proportions The z test for two independent 


proportions is an alternative large sample procedure for evaluating a 2 x 2 contingency table. 
In point of fact, the z test for two independent proportions yields a result that is equivalent to 
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that obtained with the chi-square test for r x c tables. Later in this discussion it will be 
demonstrated that if both the z test for two independent proportions (which is based on the 
normal distribution) and the chi-square test for r x c tables are applied to the same set of data, 
the square of the z value obtained for the former test will equal the chi-square value obtained for 
the latter test. 

The z test for two independent proportions is most commonly employed to evaluate the 
null and alternative hypotheses that are described for the Fisher exact test (which for a 2 x 2 
contingency table are equivalent to the null and alternative hypotheses presented in Section III 
for the chi-square test for homogeneity). Thus, in reference to Example 16.1, employing the 
model for a 2 x 2 contingency table summarized by Table 16.6, the null hypothesis and non- 
directional alternative hypotheses for thez test for two independent proportions are as follows. 


Hy T, = T5 


(In the underlying populations the samples represent, the proportion of observations in Row 1 
(the noise condition) that falls in Cell a is equal to the proportion of observations in Row 2 (the 
no noise condition) that falls in Cell c.) 


H: T, # T, 


(In the underlying populations the samples represent, the proportion of observations in Row 1 
(the noise condition) that falls in Cell a is not equal to the proportion of observations in Row 2 
(the no noise condition) that falls in Cell c.) 


An alternate but equivalent way of stating the above noted null hypothesis and 
nondirectional alternative hypothesis is as follows: Hj: x, - m, = O versus H: m, - m, # 0. 
As is the case with the Fisher exact test (as well as the chi-square test for r x c tables), the 
alternative hypothesis can also be stated directionally. Thus, the following two directional 
alternative hypotheses can be employed: H: m; > T, (which can also be stated as 
H: n-m, > 0)or H: 7, < m, (which can also be stated as H: 1, - m, < 0). 

Equation 16.9 is employed to compute the test statistic for the z test for two independent 
proportions. 


gep "o BÀ. (Equation 16.9) 


1 
pa -p—-*-— 
n, 


1 
nN 








Where: n, represents the number of observations in Row 1 
n, represents the number of observations in Row 2 
D, = al(a + b) = aln, represents the proportion of observations in Row 1 that falls 
in Cell a. It is employed to estimate the population proportion T}. 
P, = cl(c + d) = cin, represents the proportion of observations in Row 2 that falls 
in Cell c. It is employed to estimate the population proportion T. 
p = (a + c)/(n, + nj) = (a + c)/n. p is a pooled estimate of the proportion of 
observations in Column 1 in the underlying population. 


The denominator of Equation 16.9, which represents a standard deviation of a sampling 
distribution of differences between proportions, is referred to as the standard error of the dif- 
ference between two proportions (which is often summarized with the notation Sp, - m This 
latter value is analogous to the standard error of the difference (sg 7 £^ which is the 
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denominator of Equations 11.1, 11.2, 11.3, and 11.5 (which are all described in reference to the 
t test for two independent samples (Test 11)). Whereas the latter value is a standard deviation 
of a sampling distribution of difference scores between the means of two populations, the stand- 
ard error of the difference between two proportions is a standard deviation of a sampling 
distribution of difference scores between proportions for two populations. 

For Example 16.1 we either know or can compute the following values:"? 


4-30 c-60 n =100 n,- 100 


Ex. 2 0 lub doses di edi 
n +n, 100 + 100 





p= 
1 


Employing the above values in Equation 16.9, the value z = -4.26 is computed. 


part. Pe s 


cascs) 








1 1 
+ 
100 M 


The obtained value z = -4.26 is evaluated with Table A1 (Table of the Normal Dis- 
tribution) in the Appendix. In Table A1 the tabled critical two-tailed .05 and .01 values are 
Zos = 1.96 andz,, = 2.58, and the tabled critical one-tailed .05 and .01 values are zo, = 1.65 
and Z,, = 2.33. The following guidelines are employed in evaluating the null hypothesis: 

a) If the nondirectional alternative hypothesis H,: x # m, is employed, the null hypothesis 
can be rejected if the obtained absolute value of z is equal to or greater than the tabled critical 
two-tailed value at the prespecified level of significance. 

b) If the directional alternative hypothesis H,: ™, > 7, is employed, the null hypothesis 
can be rejected if p, > p,, and the obtained value of z is a positive number that is equal to or 
greater than the tabled critical one-tailed value at the prespecified level of significance. 

c) If the directional alternative hypothesis H,: x, < m, is employed, the null hypothesis 
can be rejected if p, < p,, and the obtained value of z is a negative number with an absolute 
value that is equal to or greater than the tabled critical one-tailed value at the prespecified level 
of significance. 

Employing the above guidelines, the following conclusions can be reached. 

Since the obtained absolute value z = 4.26 is greater than the tabled critical two-tailed 
values zy, = 1.96 and Zo = 2.58, the nondirectional alternative hypothesis H,: m * m, is 
supported at both the .05 and .01 levels. 

Since the obtained value of z is a negative number and absolute value z = 4.26 is greater 
than the tabled critical one-tailed values zo; = 1.65 and zo, = 2.33, the directional alternative 
hypothesis H,: =, < m, is supported at both the .05 and .01 levels. 

The directional alternative hypothesis H,: 1, > m, is not supported, since as previously 
noted, in order for it to be supported the following must be true: p, > p,. 

Note that the above conclusions for the analysis of Example 16.1 (as well as Example 16.2) 
with the z test for two independent proportions are consistent with the conclusions that are 
reached when the chi-square test for r x c tables is employed to evaluate the same set of data. 
In point of fact, the square of the z value obtained with Equation 16.9 will always be equal to the 
value of chi-square computed with Equation 16.2. This relationship can be confirmed by the 
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fact that in the example under discussion, (z = -4.26) = (y? = 18.18). It is also the case 
that the square of a tabled critical z value at a given level of significance will equal the tabled 
critical chi-square value at the corresponding level of significance. This is confirmed for the 
tabled critical two-tailed z and x? values at the .05 and .01 levels of significance: 
(Zos = 1.96)? = (xh, = 3.84) and (z,, = 2.58)? = (x, = 6.63). 

Yates’ correction for continuity can also be applied to the z test for two independent 
proportions. Equation 16.10 is the continuity-corrected equation for the z test for two inde- 
pendent proportions. 


1| 1 1 
— | |——. SF a 


[P, ~ PI ac 
2|n, m . 
Z = (Equation 16.10) 


{oe Spy ae + 
ni n, 


The following protocol is employed with respect to the numerator of Equation 16.10: a) 
If (p, - p,) isa positive number, the term [1/2][(1/n,) + (1/n,)] is subtracted from (p, - p,); 
and b) If (p, - p,) is a negative number, the term [1/2][(1/n,) + (1/n,)] is added to (p, - p,). 
An alternative way of computing the value of the numerator of Equation 16.10 is to subtract 
[1/2][(0/n,) + (1/n,)] from the absolute value of (p, - p,), and then restore the original sign 
of the latter value. 














Employing Equation 16.10, the continuity-corrected value z = -4.12 is computed. 











[3 - | + th + 5 
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Note that the absolute value of the continuity-corrected z value computed with Equation 
16.10 will always be smaller than the absolute z value computed with Equation 16.9. As is the 
case when Equation 16.9 is employed, the square of the continuity-corrected z value will equal 
the continuity-corrected chi-square value computed with Equation 16.4. Thus, (Z = -4.12Y 
= (xX? = 16.98). 

By employing Equation 16.10, the obtained absolute value of z is reduced to 4.12 (when 
contrasted with the absolute value z = 4.26 computed with Equation 16.9). Since the absolute 
value z = 4.12 is greater than both Z, = 1.96 and z,, = 2.58, the nondirectional alternative 
hypothesis H,: =, * m, is still supported at both the .05 and .01 levels. The directional 
alternative hypothesis H,: 1, < m, is also still supported at both the.05 and .01 levels, since 
the absolute value z = 4.12 is greater than both zo; = 1.65 and z,, = 2.33. 

The protocol that has been described for the z test for two independent proportions 
assumes that the researcher employs the null hypothesis Hy: 1, = T,. If, in fact, the null 
hypothesis stipulates a difference other than zero between the two values v, and 7, Equation 
16.11 is employed to compute the test statistic for the z test for two independent proportions.” 


- -(n,-m 
Zu d Wa C RUE qu. (Equation 16.11) 
necu eae 
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The use of Equation 16.11 is based on the assumption that m, + 7,. Whenever the latter 
is the case, instead of computing a pooled estimate for a common population proportion, it is 
appropriate to estimate a separate value for the proportion in each of the underlying populations 
(i.e., 1, and 7) by using the values p, and p,. This is in contrast to Equation 16.9, which 
computes a pooled p value (that represents the pooled estimate of the proportion of observations 
in Column 1 in the underlying populations). In the case of Equation 16.9, the computation of a 
pooled p value is based on the assumption that 1, = 1,. 

To illustrate the application of Equation 16.11, letus assume that the following null hypoth- 
esis and nondirectional alternative hypothesis are employed for Example 16.1: H): m, - m, 
=—.1 (which can also be written H: 1, - m, = .1)versus H,: m, - m, # -.1 (whichcanalso 
be written H,: m, - T, # .1). The null hypothesis states that in the underlying populations 
represented by the samples, the difference between the proportion of observations in Row 1 (the 
noise condition) that falls in Cell a and the proportion of observations in Row 2 (the no noise 
condition) that falls in Cell c is —1. The alternative hypothesis states that in the underlying 
populations represented by the samples, the difference between the proportion of observations 
in Row 1 (the noise condition) that falls in Cell a and the proportion of observations in Row 2 
(the no noise condition) that falls in Cell c is some value other than —.1. 

In order for the nondirectional alternative hypothesis H: x, - m, + -.1 tobe supported, 
the obtained absolute value of z must be equal to or greater than the tabled critical two-tailed 
value at the prespecified level of significance. The directional alternative hypothesis that is 
consistent with the datais H;: x, - m, < -.1. In order for the latter alternative hypothesis to 
be supported, the sign of the obtained value of z must be negative, and the obtained absolute 
value of z must be equal to or greater than the tabled one-tailed critical value at the prespecified 
level of significance. The directional alternative hypothesis H,: t, - m, > -.1 is not 
consistent with the data. Either a significant positive z value (in which case p, - p, > 0)or 
a significant negative z value (in which case 0 > (p, - p,) > -.1) is required to support the 
latter alternative hypothesis. 

Employing Equation 16.11 for the above analysis, the value z 2 —2.99 is computed. 








Puoddsde eee. uan 
C3) 7) , COCA) 
100 100 


Since the obtained absolute value z = 2.99 is greater than the tabled critical two-tailed 
.05 and .01 values Z, = 1.96 and z,, = 2.58, the nondirectional alternative hypothesis 
H: Tm, - T, # ~.1 is supported at both the .05 and .01 levels. Thus, one can conclude that in 
the underlying populations the difference (m, - T,) is some value other than —1. The 
directional alternative hypothesis H,: x, - =, < -.1 is supported at both the .05 and .01 
levels, since the obtained value z = —2.99 is a negative number and the absolute value z = —2.99 
is greater than the tabled critical one-tailed values zo, = 1.65 and zy, = 2.33. Thus, if the 
latter alternative hypothesis is employed, one can conclude that in the underlying populations the 
difference (T, - T) is some value that is less than —.1 (i.e., is a negative number with an 
absolute value larger than .1). As noted earlier, the directional alternative hypothesis 
H: 1%, - T, > -.1 is not supported, since it is not consistent with the data. 

Yates’ correction for continuity can also be applied to Equation 16.11 by employing the 
same correction factor in the numerator that is employed in Equation 16.10. Using the correction 
for continuity, the numerator of Equation 16.11 becomes: 
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Without the correction for continuity, the value of the numerator of Equation 16.11 is —.2. 
Since [1/2](1/100) + (1/100)] = .01, using the guidelines outlined previously, the value of 
the numerator becomes —2 + .01 = —.19. When the latter value is divided by the denominator 
(which for the example under discussion equals .067), it yields the continuity-corrected value 
z= —2.84. Thus, by employing the correction for continuity, the absolute value of z is reduced 
from z = 2.99 to z = 2.84. Since the absolute value z = 2.84 is greater than zo; = 1.96 and 
Zo, = 2.58, the nondirectional alternative hypothesis H,: v, - m, # —.1 is still supported at 
both the .05 and .01 levels. The directional alternative hypothesis H,: t, - m, < -.1 is also 
still supported at both the .05 and .01 levels, since the absolute value z = 2.84 is greater than 
Zos = 1.65 and Zo = 2.33. 

Cohen (1977, 1988) has developed a statistic called the h index that can be employed to 
compute the power of the z test for two independent proportions. The value / is an effect size 
index reflecting the difference between two population proportions. The concept of effect size 
is discussed in Section VI of the single-sample ¢ test (Test 2). It is discussed in greater detail 
in Section VI of the t test for two independent samples, and in Section IX (the Addendum) of 
the Pearson product-moment correlation coefficient under the discussion of meta-analysis 
and related topics. 

The equation for the k index is h = $, - $, (where d$, and d, are the arcsine 
transformed values for the proportions). Cohen (1977; 1988, Ch. 6) has derived tables that allow 
a researcher through use of the h index to determine the appropriate sample size to employ if one 
wants to test a hypothesis about the difference between two population proportions at a specified 
level of power. Cohen (1977; 1988, pp. 184—185) has proposed the following (admittedly 
arbitrary) A values as criteria for identifying the magnitude of an effect size: a) A small effect 
size is one that is greater than .2 but not more than .5; b) A medium effect size is one that is 
greater than .5 but not more than .8; and c) A large effect size is greater than .8. 


6. Computation of confidence interval for a difference between proportions With large 
sample sizes, a confidence interval can be computed that identifies a range of values within 
which one can be confident to a specified degree that the true difference lies between the two 
population proportions x, and m. Equation 16.12, which employs the normal distribution, is 
the general equation for computing the confidence interval for the difference between two 
population proportions. The notation employed in Equation 16.12 is identical to that used in the 
discussion of the z test for two independent proportions. 


Cla -a = Pi = p) E GG, - p) (Equation 16.12) 





Where s, , = ([[P,(1 -ppal + p - p21/n] 
Zą2 Iepresents the tabled critical two-tailed value in the normal distribution, below 
which a proportion (percentage) equal to [1 — (0/2)] of the cases falls. If the pro- 
portion (percentage) of the distribution that falls within the confidence interval is 
subtracted from 1 (10096), it will equal the value of a. 


Employing Equation 16.12, the 9596 confidence interval for Examples 16.1/16.2 is com- 


puted below. In employing Equation 16.12, (p, - p,) (whichis the numerator of Equation 16.9) 
represents the obtained difference between the sample proportions, z , represents the tabled 
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critical two-tailed .05 value zy, = 1.96, and s, , (which is the denominator of ee 
16.11) represents the standard error of the difference between two proportions. 








C3)CD) COCA _ 967 
100 100 ` 


Tos = (p, - P) + Ga) Cp -p) = -3 + (1996067) = -3 + .131 


D,7p,7.3-.6--.3 -— 


-.169 > (n, - 7) > -.431 


This result indicates that the researcher can be 9596 confident (or the probability is .95) that 
the true difference between the population proportions falls within the range —.431 and —.169. 
Specifically, it indicates that the researcher can be 95% confident that the proportion for Popu- 
lation 1 (1, ) is less than the proportion for Population 2 ( 1) by a value that is greater than or 
equal to .169, but not greater than .431. The result can also be written as .169 < (m, - mj) 
< .431, which indicates that the researcher can be 95% confident (or the probability is .95) that 
the proportion for Population 2 ( T.) is larger than the proportion for Population 1 (1,) by a 
value that is greater than or equal to .169, but not greater than .431. 

The 99% confidence interval, which results in a broader range of values, is computed below 
employing the tabled critical two-tailed .01 value z,, = 2.58 in Equation 16.12 in lieu of 
Zos = 1.96. 


Cl = (p, - p) + Eo) (Sp -p) = -3 € (2.58)(.067) = -3 + 173 


-.127 > (n, - mj) > -.473 


Thus, the researcher can be 99% confident (or the probability is .99) that the proportion for 
Population 1 is less than the proportion for Population 2 by a value that is greater than or equal 
to .127, but not greater than .473. 


7. Test 16e: The median test for independent samples The model for the median test for 
independent samples assumes that there are k independent groups, and that within each group 
each observation is categorized with respect to whether it is above or below a composite median 
value. In actuality, the median test for independent samples is a label that is commonly used 
when the chi-square test for r x c tables, the z test for two independent proportions, or the 
Fisher exact test is employed to evaluate the hypothesis that in each of k groups there is an equal 
proportion of observations that are above versus below a composite median. With large sample 
sizes, the median test for independent samples is computationally identical to the chi-square 
test forr xc tables (when k > 2) and the z test for two independent proportions (when k = 2). 
In the case of small samples sizes, the test is computationally identical to the Fisher exact test 
(when k 2 2). Table 16.6, which is used to summarize the model for the three aforementioned 
tests, can also be applied to the median test for independent samples. The two rows are em- 
ployed to represent the two groups, and the two columns are used to represent the two categories 
on the dependent variable — specifically, whether a score falls above versus below the median. 
Example 16.4 will be employed to illustrate the median test for two independent samples. 


Example 16.4 A study is conducted to determine whether five-year old females are more likely 


than five-year old males to score above the population median on a standardized test of eye-hand 
coordination. One hundred randomly selected females and 100 randomly selected males are 
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administered the test of eye-hand coordination, and categorized with respect to whether they 
score above or below the overall population median (i.e., the 50th percentile for both males and 
females). Table 16.11 summarizes the results of the study. Do the data indicate that there are 
gender differences in performance? 


Table 16.11 Summary of Data for Example 16.4 
Above the Below the 


median median Row sums 
Males 30 70 100 
Females 60 40 100 
Total 
Column sums 90 110 observations 200 


Since Example 16.4 involves a large sample size, it can be evaluated with Equation 16.2, 
the equation for the chi-square test for r x c tables (as well as with Equation 16.9, the equation 
for the z test for two independent proportions). The reader should take note of the fact that 
the study described in Example 16.4 conforms to the model for the chi-square test for 
homogeneity. This is the case, since the row sums (i.e., the number of males and females) are 
predetermined by the researcher. Since it is consistent with the model for the latter test, the null 
and alternative hypotheses that are evaluated with the median test for independent samples are 
identical to those evaluated with the chi-square test for homogeneity, the Fisher exact test, and 
the z test for two independent proportions." 

Since the data for Example 16.4 are identical to the data for Examples 16.1/16.2, analysis 
of Example 16.4 with Equation 16.2 yields the value x? = 18.18 (which is the value obtained 
for Examples 16.1/16.2). Since (as noted earlier) y? = 18.18 is significant at both the .05 and 
.01 levels, one can conclude that females are more likely than males to score above the median. 

In the case of the median test for independent samples, in the event that one or more 
subjects obtains a score that is equal to the population median, the following options are available 
for handling such scores: a) If the number of subjects who obtain a score that equals the median 
is reasonably large, a strong argument can be made for adding a third column to Table 16.11 for 
subjects who scored at the median. In such a case the contingency table is transformed into a 2 
x 3 table; b) If a minimal number of subjects obtain a score at the median, such subjects can be 
dropped from the data; and c) Within each group, half the scores that fall at the median value are 
assigned to the above the median category and the other half to the below the median category. 
In the final analysis, the critical thing the researcher must be concerned with in employing any of 
the aforementioned strategies, is to make sure that the procedure he employs does not lead to 
misleading conclusions regarding the distribution of scores in the underlying populations. 

The median test for independent samples can be extended to designs involving more than 
two groups. As an example, in Example 16.4 instead of evaluating the number of males and 
females who score above versus below the median, four groups of children representing different 
ethnic groups (e.g., Caucasian, Asian-American, African-American, Native-American) could 
be evaluated with respect to whether they score above versus below the median. In such a case, 
the data are summarized in the form of a 4 x 2 contingency table, with the four rows representing 
the four ethnic groups and the two columns representing above versus below the median. 

It should be noted that some sources categorize the median test for independent samples 
as a test of ordinal data, since categorizing scores with respect to whether they are above versus 
below the median involves placing scores in one of two ordered categories. The reader should 
also be aware of the fact that it is possible to categorize scores on more than two ordered cate- 
gories. As an example, male and female children can be categorized with respect to whether they 
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score in the first quartile (lower 25%), second quartile (25th to 50%), third quartile (50% to 
75%), or fourth quartile (upper 25%) on the test of eye-hand coordination. In such a case, the 
data are summarized in the form of a 2 x 4 contingency table, with the two rows representing 
males and females and the four columns representing the four quartiles. It is also possible to 
have a design involving more than two groups of subjects (e.g., the four ethnic groups discussed 
above), and a dependent variable involving more than two ordered categories (e.g., four quar- 
tiles). The data for such a design are summarized in the form of a 4 x 4 contingency table. 
Finally, it is possible to have contingency tables in which both the row and the column variables 
are ordered. Although the chi-square test for r x c tables can be employed to evaluate a design 
in which both variables are ordered, alternative procedures may provide the researcher with more 
information about the relationship between the two variables. An example of such an alternative 
procedure is Goodman and Kruskal’s gamma (Test 32) discussed later in the book. 


8. Extension of the chi-square test for r x c tables to contingency tables involving more 
than two rows and/or columns, and associated comparison procedures It is noted in the 
previous section that the chi-square test for r x c tables can be employed with tables involving 
more than two rows and/or columns. In this section larger contingency tables will be discussed, 
and within the framework of the discussion additional analytical procedures that can be employed 
with such tables will be described. Example 16.5 will be employed to illustrate the use of the 
chi-square test for r x c tables with a larger contingency table — specifically, a 4 x 3 table. 


Example 16.5 A researcher conducts a study in order to determine if there are differences in 
the frequency of biting among different species of laboratory animals. He selects random 
samples of four laboratory species from the stock of various animal supply companies. Sixty 
mice, 50 gerbils, 90 hamsters, and 80 guinea pigs are employed in the study. Each of the 
animals is handled over a two-week period, and categorized into one of the following three 
categories with respect to biting behavior: not a biter, mild biter, flagrant biter. Table 16.12 
summarizes the data for the study. Do the data indicate that there are interspecies differences 
in biting behavior? 


Table 16.12 Summary of Data for Example 16.5 





Not a biter Mild biter Flagrant biter Row sums 
Mice 60 
Gerbils 50 
Hamsters 90 
Guinea pigs 80 
Total 
Column sums 119 67 94 observations 280 


The study described in Example 16.5 conforms to the model for the chi-square test for 
homogeneity. This is the case, since the row sums (i.e., the number of animals representing each 
of the four species) is predetermined by the researcher. The row variable represents the inde- 
pendent variable in the study. The independent variable, which is comprised of four levels, is 
nonmanipulated, since it is based on a preexisting subject characteristic (i.e., species). The reader 
should take note of the fact that it is not necessary to have an equal number of subjects in each 
of the groups/categories that constitute the row variable. The column variable, which is com- 
prised of three categories, is the biting behavior of the animals. The latter variable represents the 
dependent variable in the study. Note that the marginal sums for the column variable are not 
predetermined by the researcher. 
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As is the case with a 2 x 2 contingency table, Equation 16.1 is employed to compute the 
expected frequency for each cell. As an example, the expected frequency of Cell,, (mice/not a 
biter) is computed as follows: Ej, = [(O, )(O )]/n = [(60)(119)]/280 = 25.5. In employing 
Equation 16.1, the value 60 represents the sum for Row 1 (which represents the total number of 
mice), the value 119 represents the sum for Column 1 (which represents the total number of 
animals categorized as not a biter), and 280 represents the total number of subjects/observations 
in the study. 

After employing Equation 16.1 to compute the expected frequency for each of the 4 x 3 
= 12 cells in the contingency table, Equation 16.2 is employed to compute the value 
x? = 59.16. The analysis is summarized in Table 16.13. 


Table 16.13 Chi-Square Summary Table for Example 16.5 


Cell (0, - E; 
0; E; (O; - E;) (0; - Ey Rr ONE 
ee 
Mice/Not a biter 20 25.50 -5.50 30.25 1.19 
Mice/Mild biter 16 14.36 1.64 2.69 .19 
Mice/Flagrant biter 24 20.14 3.86 14.90 .74 
Gerbils/Not a biter 30 21.25 8.75 76.56 3.60 
Gerbils/Mild biter 10 11.96 -1.96 3.84 32 
Gerbils/Flagrant biter 10 16.79 —6.79 46.10 2.75 
Hamsters/Not a biter 50 38.25 11.75 138.06 3.61 
Hamsters/Mild biter 30 21.54 8.46 71.57 3.32 
Hamsters/Flagrant biter 10 30.21 -20.21 408.44 13.52 
Guinea pigs/Not a biter 19 34.00 —15.00 225.00 6.62 
Guinea pigs/Mild biter 11 19.14 -8.14 66.26 3.36 
Guinea pigs/Flagrant biter 50 26.86 23.14 535.46 19.94 
Column sums 280 280.00 0 X = 59.16 


Substituting the values r = 4 and c = 3 in Equation 16.3, the number of degrees of freedom 
for the analysis are df= (4 — 1)(3 - 1) = 6. Employing Table A4, the tabled critical .05 and .01 
chi-square values for df = 6 are Ys - 12.59 and Xo = 16.81. Since the computed value 
X? = 59.16 is greater than both of the aforementioned critical values, the null hypothesis can 
be rejected at both the .05 and .01 levels. By virtue of rejecting the null hypothesis, the 
researcher can conclude that the four species are not homogeneous with respect to biting 
behavior, or to be more precise, that at least two of the species are not homogeneous. 

In the case of a 2 x 2 contingency table, a significant result indicates that the two groups 
employed in the study are not homogeneous with respect to the dependent variable. However, 
in the case of a larger contingency table, although a significant result indicates that at least two 
of the r groups are not homogeneous, the chi-square analysis does not indicate which of the 
groups differ from one another or which of the cells are responsible for the significant effect. 
Visual inspection of Tables 16.12 and 16.13 suggests that the significant effect in Example 16.5 
is most likely attributable to the disproportionately large number of flagrant biters among 
guinea pigs, and the disproportionately small number of flagrant biters among hamsters. In 
lieu of visual inspection (which is not a precise method for identifying the cells that are primarily 
responsible for a significant effect), the following two types of comparisons are among those that 
can be conducted. 


Simple comparisons A simple comparison is a comparison between two of the r rows of an 
r X c contingency table (or two of the c columns). Table 16.14 summarizes a simple comparison 
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that contrasts the biting behavior of mice with the biting behavior of guinea pigs. Note that in 
the simple comparison, the data for the other two species employed in the study (i.e., gerbils and 
hamsters) are not included in the analysis. 


Table 16.14 Simple Comparison for Example 16.5 


Not a biter Mild biter Flagrant biter Row sums 
Mice 20 16 24 60 
Guinea pigs 19 11 50 80 
Total 
Column sums 39 27 74 observations 140 


It should be noted that a simple comparison does not have to involve all of the columns of 
the contingency table. Thus, one can compare the two species mice and guinea pigs, but limit 
the comparison within those species to only those animals who are classified not a biter and 
flagrant biter. The resulting 2 x 2 contingency table for such a comparison will differ from 
Table 16.14 in that the second column (mild biter) is not included. As a result of omitting the 
latter column, the row sum for mice is reduced to 44 and the row sum for guinea pigs is reduced 
to 69. The total number of observations for the comparison is 113. 


Complex comparisons A complex comparison is a comparison between two or more of the 
r rows of an r x c contingency table with one of the other rows or two or more of the other rows 
of the table.? Table 16.5 summarizes an example of a complex comparison which contrasts the 
biting behavior of guinea pigs with the combined biting behavior of the other three species 
employed in the study (i.e., mice, gerbils, and hamsters). 

As is the case for a simple comparison, a complex comparison does not have to include all 
of the columns on which the groups are categorized. Thus, one can compare the mice, gerbils, 
and hamsters with guinea pigs, but limit the comparison to only those animals who are 
classified not a biter and flagrant biter. The resulting 2 x 2 contingency table for such a 
comparison will differ from Table 16.15, in that the second column (mild biter) will not be 
included. As a result of omitting the latter column, the row sum for mice, gerbils, and hamsters 
is reduced to 144 and the row sum for guinea pigs is reduced to 69. The total number of 
observations for the comparison is 213. 


Table 16.15 Complex Comparison for Example 16.5 





Not a biter Mild biter Flagrant biter Row sums 
Mice, Gerbils, 
and Hamsters 200 
Guinea pigs 80 
Total 
Column sums 119 67 94 observations 280 


The null and alternative hypotheses for simple and complex comparisons are identical to 
those that are employed for evaluating the original r x c table, except for the fact that they are 
stated in reference to the specific cells involved in the comparison. 

Sources are not in agreement with respect to what protocol is most appropriate to employ 
in conducting simple and/or complex comparisons following the computation of a chi-square 
value for an r x c contingency table. Among those sources that describe comparison procedures 
for r x c contingency tables are Keppel and Saufley (1980), Keppel et al. (1992), Marascuilo and 
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McSweeney (1977), and Siegel and Castellan (1988). Keppel et al. (1992) note that Wickens 
(1989) states that, as a general rule, the different protocols which have been developed for con- 
ducting comparisons yield comparable results. In discussing the general issue of partitioning 
contingency tables (i.e., breaking down the table for the purposes of conducting comparisons), 
most sources emphasize that whenever feasible, a researcher should plan a limited number of 
simple and/or complex comparisons prior to the data collection phase of a study, and that any 
comparisons one conducts should be meaningful at either a theoretical and/or practical level.'? 
In the discussion of comparisons in Section VI of the chi-square goodness-of-fit test, itis noted 
that when a limited number of comparisons are planned beforehand, most sources take the 
position that a researcher is not obliged to control the overall Type I error rate. However, when 
comparisons are not planned, there is general agreement that in order to avoid inflating the Type 
I error rate, the latter value should be adjusted. One way of adjusting the Type I error rate is to 
divide the maximum overall Type I error rate one is willing to tolerate by the total number of 
comparisons to be conducted. The resulting probability value can then be employed as the alpha 
level for each comparison that is conducted. To illustrate, if one intends to conduct three 
comparisons and does not want the overall Type I error rate for all of the comparisons to be 
greater than a = .05, the alpha level to employ for each comparison is a/number of comparisons 
= .05/3 = .0167. There are those who would argue the latter adjustment is too severe, since it 
substantially reduces the power associated with each comparison. In the final analysis, the 
researcher must be the one who decides what per comparison alpha level strikes an equitable 
balance in terms of the likelihood of committing a Type I error and the power associated with a 
comparison. Obviously, if a researcher employs a severely reduced alpha value, it may become 
all but impossible to detect actual differences that exist between the underlying populations. 
Before continuing this section, the reader may find it useful to review the discussion of com- 
parisons for the chi-square goodness-of-fit test. A thorough overview of the issues involved 
in conducting comparisons can be found in Section VI of the single-factor between-subjects 
analysis of variance (Test 21). In the remainder of this section, two comparison procedures for 
an r x c contingency table (based on Keppel et al. (1992) and Keppel and Saufley (1980)) will 
be described. The procedures to be described can be employed for both planned and unplanned 
comparisons. 


Method 1 The first procedure to be described (which is derived from Keppel et al. (1992)) 
employs for both simple and complex comparisons the same protocol to evaluate a comparison 
contingency table as the protocol that is employed when the complete r x c table is evaluated. 
In the case of Example 16.5, Equation 16.2 is employed to evaluate the simple comparison 
summarized by the 2 x 3 contingency table in Table 16.14. The analysis assumes there is a total 
of 140 observations. Equation 16.1 is employed to compute the expected frequency of each cell 
— i.e., the sum of the observations in the row in which the cell appears is multiplied by the sum 
of the observations in the column in which the cell appears, with the resulting product divided 
by the total number of observations in the 2 x 3 contingency table. As an example, the expected 
frequency of Cell, (mice/not a biter) is computed as follows: £,, = [(O, XO ,)]/n 
= [(60)(39)]/140 = 16.71. In employing Equation 16.1, the value 60 represents the sum for 
Row 1 (which represents the total number of mice), the value 39 represents the sum for Column 
1 (which represents the total number of mice and guinea pigs categorized as not a biter), and 
140 represents the total number of observations in the table (i.e., mice and guinea pigs). 

The sums for the two rows/species involved in the simple comparison under discussion are 
identical to the row sums for those species in the original 4 x 3 contingency table. The column 
sums, however, only represent the sums of the columns for the two species involved in the com- 
parison, and are thus different from the column sums for all four species that are computed when 
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the original 4 x 3 contingency table is evaluated. Table 16.16 summarizes the computation of 
the value y? = 7.39 (with Equation 16.2) for the simple comparison summarized in Table 16.14. 


Table 16.16 Chi-Square Summary Table for Simple Comparison 
in Table 16.14 Employing Method 1 


(O., - EY 
E Ey ij ij 
Cell 0; E; (O, -E) (Oj -Ep E EE 
ij 

Mice/Not a biter 20 16.71 3.29 10.82 .65 
Mice/Mild biter 16 11.57 4.43 19.62 1.70 
Mice/Flagrant biter 24 31.71 -7.71 59.44 1.87 
Guinea pigs/Not a biter 19 22.29 -3.29 10.82 .49 
Guinea pigs/Mild biter 11 15.43 —4.43 19.62 1.27 
Guinea pigs/Flagrant biter 50 42.29 eral 59.44 1.41 
Column sums 140 140.00 0 21.39 


Substituting the values r = 2 and c = 3 in Equation 16.3, the number of degrees of freedom 
for the comparison are df= (2 — 1)(3 - 1) 2 2. Employing Table A4, the tabled critical .05 and 
01 chi-square values for df = 2 are X^ = 5.99 and Xo = 9.21. Since the computed value X? = 7.39 
is greater than Xs = 5.99, the null hypothesis can be rejected at the .05 level. It cannot, 
however, be rejected at the .01 level, since x7 = 7.39 is less than Xi = 9.21. By virtue of 
rejecting the null hypothesis the researcher can conclude that mice and guinea pigs, the two 
species involved in the comparison, are not homogeneous with respect to biting behavior. 
Inspection of Table 16.14 reveals that the significant difference can primarily be attributed to the 
fact that there are a disproportionately large number of flagrant biters among guinea pigs. 


Method 2 An alternative and somewhat more cumbersome method for conducting both simple 
and complex comparisons described by Bresnahan and Shapiro (1966) and Castellan (1965) is 
presented in Keppel and Saufley (1980). Although most statisticians would consider the method 
to be described in this section as preferable to Method 1, in most cases the latter method will 
result in similar conclusions with regard to the hypothesis under study. The method to be de- 
scribed in this section is identical to Method 1, except for the fact that in computing the chi- 
square value, for each cell in the comparison contingency table the value (0;; = Ey is divided 
by the expected frequency computed for the cell when the original r x c contingency table is 
evaluated (instead of the expected frequency for the cell based on the row and column sums in 
the comparison contingency table). Specifically, for each of the cells in the comparison con- 
tingency table, the following values are computed: a) The expected frequency for each cell is 
computed as it is for Method 1 (i.e., using the row and column sums for the comparison 
contingency table and employing the number of observations in the comparison contingency 
table to represent the value of n); b) The values (0; = E) and (O; = Ey are computed for 
each cell just as they are in Method 1; and c) In computing the values in the last column of the 
chi-square summary table, instead of dividing (0;; z Ey by E; it is divided by the expected 
frequency of the cell when all r x c cells in the original table are employed to compute the 
expected cell frequency. In the analysis to be described, the latter expected frequency for each 
cell is represented by the notation E,;. The steps noted above are illustrated in Table 16.17, 
which summarizes the application of Method 2 for computing the chi-square statistic for the 
simple comparison presented in Table 16.14. Note that the value of E;; for each cell is the same 
as the value of the expected frequency for that cell in Table 16.13. 
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Table 16.17 Chi-Square Summary Table for Simple Comparison 
in Table 16.14 Employing Method 2 


Cell O E 0,-E,) (0,-E} E (0 - Ey 
5 ij y | (0j-Ep) (0j-E) ü Eo 
ij 
Mice/Not a biter 20 16.71 3.29 10.82 25.50 42 
Mice/Mild biter 16 11.57 4.43 19.62 14.36 1.37 
Mice/Flagrant biter 24 31.71 -7.71 59.44 20.14 2.95 
Guinea pigs/Not a biter 19 22.29 3.29 10.82 34.00 32 
Guinea pigs/Mild biter 11 15.43 4.43 19.62 19.14 1.03 
Guinea pigs/Flagrant biter 50 42.29 7.71 59.44 26.86 2.03 
Column sums 140 140.00 0 y= 8.12 


As is the case when Method 1 is employed, the null hypothesis can be rejected at the .05 
level but not at the .01 level, since for df = 2, the computed value X? = 8.12 is greater than 
Xs = 5.99, but less than Xi = 9.21. Although the computed value x? = 8.12 is slightly 
larger than the value Xy? = 7.39 computed with Method 1, the difference between the two 
chi-square values is minimal. As noted previously, the two methods will generally yield 
approximately the same value. 

Method 1 and Method 2 will now be employed to evaluate the complex comparison 
summarized in Table 16.15. The results of these analyses are summarized in Tables 16.18 and 
16.19. 


Table 16.18 Chi-Square Summary Table for Complex Comparison 
in Table 16.15 Employing Method 1 


; (O, j B E; Pd 
Cell Oj E; (O; 7 Ej) (0; i E) WES o 
Mice, Gerbils, Hamsters/Not a biter 100 85.00 15.00 225.00 2.65 
Mice, Gerbils, Hamsters/Mild biter 56 47.86 8.14 66.26 1.38 
Mice, Gerbils, Hamsters/Flagrant biter 44 67.14 -23.14 535.46 7.98 
Guinea pigs/Not a biter 19 34.00 -15.00 225.00 6.62 
Guinea pigs/Mild biter 11 19.14  -8.14 66.26 3.46 
Guinea pigs/Flagrant biter 50 26.86 23.14 535.46 19.34 
Column sums 280 280.00 0 xi = 42.03 

Table 16.19 Chi-Square Summary Table for Complex Comparison 
in Table 16.15 Employing Method 2 
$ de AUD 

Cell Oj E; (O; - E;) (0; E E) E; OE 
Mice, Gerbils, Hamsters/Not a biter 100 85.00 15.00 225.00 85.00 2.65 
Mice, Gerbils, Hamsters/Mild biter 56 47.86 8.14 66.26 47.86 1.38 
Mice, Gerbils, Hamsters/Flagrant biter 44 67.14 -23.14 535.46 67.14 7.98 
Guinea pigs/Not a biter 19 34.00 -15.00 225.00 34.00 6.62 
Guinea pigs/Mild biter 11 19.14 -8.14 66.206 19.14 3.46 
Guinea pigs/Flagrant biter 50 26.86 23.14 535.46 26.86 19.94 
Column sums 280 280.00 0 xi = 42.03 
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Although in the case of the complex comparison, Method 1 and Method 2 both yield the 
value X? = 42.03, the two methods will not always yield the same value. Substituting the values 
r = 2 and c = 3 in Equation 16.3, the number of degrees of freedom for the comparison are 
df-(2—1)3- 1) =2. Employing Table A4, the tabled critical .05 and .01 chi-square values 
for df = 2 are Xs - 5.99 and Con = 9.21. Since the computed value x? = 42.03 is greater 
than both of the aforementioned critical values, the null hypothesis can be rejected at both the 
.05 and .01 levels. By virtue of rejecting the null hypothesis, the researcher can conclude that 
a combined population of mice, gerbils, and hamsters is not homogeneous with a population 
of guinea pigs with respect to biting behavior. Inspection of Table 16.15 reveals the latter can 
primarily be attributed to the discrepancy between the number of guinea pigs in the flagrant 
biters category and the number of mice, gerbils, and hamsters in the not a biter category. 

As previously noted, if the two comparisons summarized in Tables 16.14 and 16.15 are not 
planned prior to collecting the data, most sources would argue that each comparison should be 
evaluated at a lower alpha level. If one does not want the likelihood of committing at least one 
Type I error in the set of two comparisons to be greater than .05, one can adjust the alpha level as 
follows: Adjusted a level = o/number of comparisons = .05/2 = .025. Thus, in evaluating each 
of the comparisons, the tabled critical chi-square value at the .025 level is employed instead of the 
tabled critical .05 value Xs = 5.99 (although as noted earlier some sources might consider the 
latter adjustment to be too severe). In Table A4, for the appropriate degrees of freedom, the 
tabled critical .025 value corresponds to the value listed under Xans. In the case of df = 2, 
Tons = 7.38. Note that for the simple comparison discussed in this section, the obtained chi- 
square value y? = 7.39 obtained with Method 1 barely achieves significance if the latter critical 
value is used, whereas without the adjustment the latter result is significant at the .05 level by a 
comfortable margin. This example should serve to illustrate that by employing a lower alpha 
level, in addition to decreasing the likelihood of committing a Type I error, one also (by virtue of 
reducing the power of the test) decreases the likelihood of rejecting a false null hypothesis. 

The rationale for presenting two comparison methods in this section is to demonstrate that 
although there is not a consensus among different sources with respect to what procedure should 
be employed for conducting comparisons, as a general rule, if a significant effect is present it will 
be identified regardless of which method one employs. Although Method 1 is the simpler of the 
two methods described in this section, as noted earlier, most sources would probably take the 
position that Method 1 is more subject to challenge on statistical grounds. Although in some 
instances there will be differences with respect to the precise probability values associated with 
the two methods (as well as the probabilities associated with other available methods), in most 
cases such differences will be trivial, and will thus be of little importance in terms of their 
practical and/or theoretical implications. In the final analysis, the per comparison alpha level one 
elects to employ (rather than the use of a different comparison procedure) is the most likely 
reason why two or more researchers may reach dramatically different conclusions with respect 
to a specific comparison. In such an instance, a replication study is the best available option for 
clarifying the status of the null hypothesis. Alternative procedures for conducting comparisons 
with contingency tables are described in Marascuilo and McSweeney (1977), Marascuilo and 
Serlin (1988), and Siegel and Castellan (1988). 


9. The analysis of standardized residuals As noted in Section VI of the chi-square goodness- 
of-fit test, an alternative procedure for conducting comparisons (developed by Haberman (1973) 
and cited in Siegel and Castellan (1988)) involves the computation of standardized residuals. 
By computing standardized residuals, one is able to determine which cells are the major con- 
tributors to a significant chi-square value. The computation of residuals can be useful in rein- 
forcing or clarifying information derived from the comparison procedures described in the 
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previous section, as well as providing additional information on the data contained in an r x c 
table. Through use of Equation 16.13, a standardized residual (R;;) can be computed for each 
cell in an r x c contingency table. 


R.. = Oy Ey - Ej 
1J /E; 
ij 


The value computed for a residual with Equation 16.13, which is interpreted as a normally 
distributed variable, is evaluated with Table A1. Any residual with an absolute value that is 
equal to or greater than the tabled critical two-tailed .05 value zo; = 1.96 is significant at the 
.05 level. Any residual that is equal to or greater than the tabled critical two-tailed .01 value 
Zo, = 2.58 is significant at the .01 level. Any cell in a contingency table which has a significant 
residual makes a significant contribution to the obtained chi-square value. For any cell that has 
a significant residual, one can conclude that the observed frequency of the cell differs 
significantly from its expected frequency. The sign of the standardized residual indicates 
whether the observed frequency of the cell is above (+) or below (—) the expected frequency. The 
sum of the squared residuals for all r x c cells will equal the obtained value of chi-square. The 
analysis of the residuals for Example 16.5 is summarized in Table 16.20. 


(Equation 16.13) 


Table 16.20 Analysis of Residuals for Example 16.5 








_ O;-F) R- (9,-Epr 
c xe coc rug 
Cell O; E; (0;; - Ej) J [E E, 
Mice/Not a biter 20 25.50 -5.50 —1.09 1.19 
Mice/Mild biter 16 14.36 1.64 .43 .19 
Mice/Flagrant biter 24 20.14 3.86 .86 74 
Gerbils/Not a biter 30 21.25 8.75 1.90 3.61 
Gerbils/Mild biter 10 11.96 —1.96 —57 .33 
Gerbils/Flagrant biter 10 16.79 —6.79 —1.66 2.76 
Hamsters/Not a biter 50 38.25 11.75 1.90 3.61 
Hamsters/Mild biter 30 21.54 8.46 1.82 3.32 
Hamsters/Flagrant biter 10 30.21 -20.21 -3.68" 13.52 
Guinea pigs/Not a biter 19 34.00 -15.00 -2.5T 6.62 
Guinea pigs/Mild biter 11 19.14 -8.14 —1.86 3.46 
Guinea pigs/Flagrant biter 50 26.86 23.14 4.46" 19.93 
Column sums 280 280.00 0 XR! = 59.28 


1J 


"Significant at the .05 level. 
"Significant at the .01 level. 


Inspection of Table 16.20 indicates that the residual computed for the following cells is 
significant: a) guinea pigs/flagrant biters — Since the absolute value Ry = 4.46 computed 
for the residual is greater than zo; = 1.96 and Zo = 2.58, the residual is significant at both 
the .05 and .01 levels. The positive value of the residual indicates that the observed frequency 
of the cell is significantly above its expected frequency. The value of the residual for this cell 
is consistent with the fact that in the comparisons discussed in the previous section, the cell 
guinea pigs/flagrant biters appears to play a critical role in the significant effect that is detected; 
b) hamsters/flagrant biter — Since the absolute value Rj = 3.68 computed for the residual is 
greater than zo; = 1.96 and zy, = 2.58, the residual is significant at both the .05 and .01 levels. 
The negative value of the residual indicates that the observed frequency of the cell is significantly 
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below its expected frequency. The value of the residual for this cell is consistent with the fact 
that in the comparisons discussed in the previous section the cell hamsters/flagrant biter 
appears to play a critical role in the significant effect that is detected; and c) guinea pigs/not a 
biter — since the absolute value R, = 2.57 computed for the residual is greater than 
Ao 1.96, but is less (albeit barely) than Zor = 2.58, the residual is significant at the .05 level. 
The negative value of the residual indicates that the observed frequency of the cell is significantly 
below its expected frequency. 

Four other cells in Table 16.20 approach being significant at the .05 level (gerbils/not a 
biter, hamsters/not a biter, guinea pigs/mild biter, hamsters/mild biter). The absolute value 
of the residual for all of the aforementioned cells is close to the tabled critical value zo; = 1.96. 
Note that in Table 16.20, the sum of the squared residuals is essentially equal to the chi-square 
value computed in Table 16.13 for Example 16.5 (the minimal discrepancy is the result of 
rounding off error). 


10. Sources for computing the power of the chi-square test for r x c tables Cohen (1977, 
1988) has developed a statistic called the w index that can be employed to compute the power 
of the chi-square test for r x c tables (as well as the chi-square goodness-of-fit test). The 
value w is an effect size index reflecting the difference between expected and observed 
frequencies. The concept of effect size is discussed in Section VI of the single-sample ¢ test. 
It is discussed in greater detail in Section VI of the f test for two independent samples, and in 
Section IX (the Addendum) of the Pearson product-moment correlation coefficient under the 
discussion of meta-analysis and related topics. 

The equation for the w index is w - y EU - P Par The latter equation indi- 
cates the following: a) For each of the cells in the chi-square table, the proportion of cases 
hypothesized in the null hypothesis is subtracted from the proportion of cases hypothesized in the 
alternative hypothesis; b) The obtained difference in each cell is squared, and then divided by the 
proportion hypothesized in the null hypothesis for that cell; c) All of the values obtained for the 
cells in part b) are summed; and d) w represents the square root of the sum obtained in part c). 

Cohen (1977, 1988, Ch. 7) has derived tables that allow a researcher to determine, through 
use of the w index, the appropriate sample size to employ if one wants to test a hypothesis about 
the difference between observed and expected frequencies in a chi-square table at a specified 
level of power. Cohen (1977, 1988, pp. 224—226) has proposed the following (admittedly 
arbitrary) w values as criteria for identifying the magnitude of an effect size: a) A small effect 
size is one that is greater than .1 but not more than .3; b) A medium effect size is one that is 
greater than .3 but not more than .5; and c) A large effect size is greater than .5. 





11. Heterogeneity chi-square analysis for a 2 x 2 contingency table Let us suppose that a 
researcher conducts m independent studies (where m » 2), each of which evaluates the same 
hypothesis. In order to illustrate the analysis to be described in this section, we will assume that 
the null hypothesis evaluated in each study is that a particular type of kidney disease is equally 
likely to occur in males and females. The data for each study are summarized in a 2 x 2 con- 
tingency table, and each of the contingency tables is evaluated with the chi-square test for 
homogeneity. Although none of the m analyses yield a statistically significant result, visual 
inspection of the data suggests to the researcher that the disease occurs more frequently in females 
than males The researcher suspects that because of the relatively small sample size employed in 
each study, the absence of significant results may be due to a lack of statistical power. In order 
to increase the power of the analysis, the researcher wants to combine the data for the m studies 
into one 2 x 2 contingency table, and evaluate the latter table with the chi-square test for 
homogeneity. This section will present a procedure (described in Zar (1999, pp. 500—504)) for 
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determining whether or not a researcher is justified in pooling data under such conditions. The 
procedure to be presented, which is referred to as heterogeneity chi-square analysis, was 
previously employed in reference to a one-dimensional chi-square table in Section VI of the chi- 
square goodness-of-fit test. 

The null and alternative hypotheses that are evaluated with a heterogeneity chi-square 
analysis are as follows. 


Null hypothesis H,: The m samples are derived from the same population (i.e., 
population homogeneity). 


Alternative hypothesis H,: Atleast two of the m samples are not derived from the same 
population (i.e., population heterogeneity). 


Table 16.21 summarizes the heterogeneity chi-square analysis conducted on m = 3 
2 x 2 contingency tables. Each table summarizes the results of a study evaluating the frequency 
of occurrence of the kidney disease in males and females. Part A of Table 16.21 presents the 
analysis of each study with the chi-square test for homogeneity. Column 2 for each of the 
studies contains the observed frequency for the relevant cell of the 2 x 2 contingency table. 
Thus, using the notation in Table 16.6, the following cells are represented in the contingency 
tables: Cell a: Males who develop the disease; Cell b: Males who do not develop the disease; 
Cell c: Females who develop the disease; Cell d: Females who do not develop the disease. 

The following protocol is employed in the heterogeneity chi-square analysis: a) Em- 
ploying the chi-square test for homogeneity, a chi-square value is computed for each of the 
individual studies. Zar (1999) notes that even though each table has one degree of freedom, the 
correction for continuity should not be employed in analyzing the individual contingency tables 
at this point in the analysis; b) The sum of the m chi-square values obtained in a) for the 
individual studies is computed. The latter value itself represents a chi-square value, and will be 
designated Xan: In addition, the sum of the degrees of freedom for the m studies is computed. 
The latter degrees of freedom value, which will be designated df n» is obtained by summing the 
df values for each of the individual studies. Since in the case of a 2 x 2 contingency table 
df = (r — 1)(c — 1) = 1, the value of df „„ will equal the number of studies that are evaluated; 
c) The data for the m studies are combined into one table, and through use of the chi-square 
test for homogeneity a chi-square value, which will be designated Xpooled> is computed for the 
pooled data. The degrees of freedom for the table with the pooled data, which will be designated 
fooled? is equal to df= 1, since df for a 2 x 2 contingency table is df= (r — 1)(c - 1) 2 1. Even 
though the table has one degree of freedom, the correction for continuity is not used in analyzing 
the table with the pooled data at this point in the analysis; d) The heterogeneity chi-square 
analysis is based on the premise that if the m samples are in fact homogeneous, the sum of the m 
individual chi-square values (Xam) should be approximately the same value as the chi-square 
value computed for the pooled data Oe oap: In order to determine the latter, the absolute value 
of the difference between the sum of the m chi-square values (obtained in b)) and the pooled chi- 
square value (obtained in c)) is computed. The obtained difference, which is itself a chi- 
Square value, is the heterogeneity chi-square value, which will be designated Xa. Thus, 
Xi = lad, d onse The null hypothesis will be rejected when there is a large difference 
between the values of ¥,,,,, and / NM The value X which represents the test statistic, is 
evaluated with a degrees of freedom value that is the sum of the degrees of freedom for the m 
individual studies (df) less the degrees of freedom obtained for the table with the pooled data 
(df, oie 4). Thus, df, = Boum ~ Toole a; In order to reject the null hypothesis, the value Xa 
must be equal to or greater than the tabled critical value at the prespecified level of significance 
for df, ,; and e) If the null hypothesis is rejected the data cannot be pooled. If, however, the null 
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hypothesis is retained, the data can be pooled, and the computed value for ums is employed 
to evaluate the goodness-of-fit hypothesis. Zar (1999) notes, however, that the table for the 
pooled data should be reevaluated employing the correction for continuity, and the continuity- 
corrected Dus value (which will be a little lower than the original d ois value) should be 
employed to evaluate the relevant hypothesis for the contingency table. 

The computed chi-square values for the three studies (summarized in Part A of Table 
16.21) are Xx: - 1.69, Xi - 2.88, and Xi - 2.63. The total number of degrees of freedom 
employed for the three studies is df m = (3)(1) = 3 (i.e., the number of studies (3) multiplied by 
the number of degrees of freedom per study (1)). Since a single 2 x 2 contingency table is 
evaluated for the pooled data, the degrees of freedom for the latter table is M, — a 1. By 
summing the chi-square values for the three studies, we compute E value Xin = = 1.69 + 2.88 
+ 2.68 = 7.20. Since in Part B of Table 16.21 we compute Neots = 6.84, the value for 
heterogeneity chi-square (computed in Part C of Table 16.21) is Xi = = loon = = 7.20) - 
( x urs 6.84)| 2.36. The degrees of freedom employed to evaluate the latter chi-square value 
are df, = (Boum = 3) - (df, ooled 7 1) = 2. The tabled critical .05 and.01 chi-square values 


in Table A4 for df= 2 are Xs - 5.99 and Xi = 9.21. Since the computed value Xia = .36 
is less than Xs = 5.99, the null hypothesis is retained. In other words we can conclude the 
three samples are homogeneous (i.e., come from the same population), and thus we can justify 
pooling the data into a single table. 

As noted earlier, the correction for continuity is not employed in computing the value 
i = 6.84 for the pooled data in Part B of Table 16.21. The data for the pooled con- 
tingency table are now reevaluated with the chi-square test for homogeneity, employing the 
correction for continuity. The continuity-corrected chi-square value (obtained in Part D of 
Table 16.21) is d ud - 6.02. The degrees of freedom for the pooled contingency table are 
df = (r — 1)(c — 1) = 1. The tabled critical .05 and .01 values in Table A4 for df = 1 are 
Xs - 3.84 and Yi - 6.63. Sincethe value Toia = 6.02 is larger than Xs = 3.84, the null 
hypothesis for the data in the contingency table can be rejected at the .05 level (but not at the .01 
level). In other words, with respect to the pooled data, we can conclude that in the case of at least 
one of the cells, there is a difference between its observed and expected frequency. Inspection 
of the data suggests that the disease occurs more frequently in females than it does in males. A 
more detailed discussion of the heterogeneity chi-square analysis can be found in Zar (1999). 

It should be emphasized that a researcher should employ common sense in applying the 
heterogeneity chi-square analysis described in this section. To be more specific, there may be 
occasions when even though the computed value of XL. is not significant, in spite of the latter 
it would not be recommended that the researcher pool the data from two or more smaller tables. 
To be more specific, one should not pool data from two or more tables employing small sample 
sizes (which, when evaluated individually, fail to yield a significant chi-square value) in order 
to obtain a significant pooled chi-square value, when there is an obvious inconsistency in the cell 
proportions for two or more of the tables. In other words, when the data from m tables are 
pooled, the proportion of cases in the cells of each of the m tables should be approximately the 
same. Everitt (1977, 1992) and Fleiss (1981) recommend alternative procedures for pooling the 
data from multiple chi-square tables. 

Another procedure that can be employed to combine the information in a set of 2 x 2 
contingency tables is the Mantel-Haenszel method (Mantel and Haenszel (1959)). Pagano and 
Gauvreau (1993), who provide a detailed description of the procedure, note that the Mantel- 
Haenszel method allows a researcher do the following with a set of m 2 x 2 contingency tables: 
a) Evaluate whether the tables are homogeneous; b) Compute a summary odds ratio (which is 
discussed in the next section) for the tables; c) Compute a confidence interval for the odds ratio; 
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Table 16.21 Heterogeneity Chi-Square Analysis for Three 2 x 2 Contingency Tables 


A. Chi-square analysis of three individual studies 


2 
(O; -E;) 


= 2 
(0, - E) 
E 


ij 


Study 1 
(a) Male/Disease 12 


6.25 
6.25 
6.25 
6.25 


.43 
.43 
.40 
.40 


Xi = 1.69 


X(0, -E) =0 


(a) Male/Disease 10 


MO SO NO o 


.69 
.69 
. [5 
TS 


X; = 2.88 


EE,-50 X(0,-E)-0 


(a) Male/Disease 6 


6.50 
6.50 
6.50 


16 
69 
.62 


(d) Female/No disease 9 


Cell O, ij 
(b) Male/No disease 17 
(c) Female/Disease 18 
(d) Female/No disease 13 

X0,- 60 
(b) Male/No disease 16 
(c) Female/Disease 15 
(d) Female/No disease 9 

X0, = 50 
(b) Male/No disease 12 
(c) Female/Disease 13 

X0, = 40 


E; (O; E E) 
14.5 -2.5 
14.5 2.5 
15.5 2.5 
15.5 -2.5 
XE, = 60 
Study 2 
13 -3 
13 3 
12 3 
12 -3 
Study 3 
8.55 -2.55 
9.45 2.55 
10.45 2.55 
11.55 -2.55 
EE =40  X(0,-E)-0 


6.50 


56 
13 = 2.63 


Sum of chi-square values for three studies = Xam = 1.69 + 2.88 + 2.63 = 7.20 


B. Chi-square analysis of pooled data 


Cell O, ij 
(a) Male/Disease 28 
(b) Male/No disease 45 


(c) Female/Disease 46 
(d) Female/No disease 31 


X0, = 150 XE, = 60 


Pooled data for m = 3 studies 


E; (O; z E;) 
36.01 —8.01 
36.99 8.01 
37.99 8.01 
39.01 —8.01 


X(0, - E) =0 


2 
(O; -E 


64.16 
64.16 
64.16 
64.16 


C. Heterogeneity of chi-square analysis 


= 2 
(0, - E) 
E, 
1.78 
1.73 
1.69 
1.64 


2 
Xpooled = 6.84 


Heterogeneity chi-square = Sum of chi-square values for four studies — Pooled chi-square value 
2 2 2 
Xhet = (Xsum = 7.20) - (Xpooted = 6.84) = .36 


D. Continuity-corrected chi-square analysis of pooled data 


Pooled data for m = 3 studies (continuity-corrected) 


Cell O, ij 
(a) Male/Disease 28 
(b) Male/No disease 45 


(c) Female/Disease 46 
(d) Female/No disease 31 


X0, = 150 XE, = 60 
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(|0,; E, | -5Y 
Ej I0, - El-5 (0, - E] -5* NEC ES 
Uy 
36.01 7.51 56.40 1.57 
36.99 7.51 56.40 1.52 
37.99 7.51 56.40 1.48 
39.01 7.51 56.40 1.45 


X(0, - E) =0 


2 
Xpooled = 6.02 


and d) Test the general hypothesis that each table was designed to evaluate (Pagano and 
Gauvreau (1993, p. 354)). Other sources that discuss the Mantel-Haenszel method are Conover 
(1999), Hollander and Wolfe (1999), Rosner (1995), and Sprent (1993). Meta-analysis (which 
is discussed in Section IX (the Addendum) of the Pearson product-moment correlation 
coefficient) can be also employed to obtain a combined result for multiple 2 x 2 contingency 
tables that evaluate the same general hypothesis (although meta-analysis does not pool all of the 
data into a single 2 x 2 table). 


12. Measures of association for r x c contingency tables Prior to reading this section the 
reader may find it useful to review the discussion of magnitude of treatment effect (which is 
a measure of the degree of association between two or more variables) in Section VI of the ¢ test 
for two independent samples. The computed chi-square value for an r x c contingency table 
does not provide a researcher with precise information regarding the size of the treatment effect 
present in the data. The reason why a chi-square value computed for a contingency table is not 
an accurate index of the degree of association between the two variables is because the chi-square 
value is a function of both the total sample size and the proportion of observations in each of the 
r x c cells of the table. The degree of association, on the other hand, is independent of the total 
sample size, and is only a function of the cell proportions. In point of fact, the magnitude of the 
computed chi-square statistic is directly proportional to the total sample size employed in a study. 

To illustrate the latter point, consider Table 16.22 which presents a different set of data for 
Example 16.1. In actuality, the effect size for the data presented in Table 16.22 is the same as 
the effect size in Table 16.2, and the only difference between the two tables is that in Table 16.22 
the total sample size and the number of observations in each of the cells are one-half of the 
corresponding values listed in Table 16.2. If Equation 16.2 is applied to the data presented in 
Table 16.22, the value y? = 9.1 is computed. Note that the latter value is one-half of 
X? = 18.18, which is the value computed for Table 16.2. Thus, even though the same 
proportion of observations appears in the corresponding cells of the two tables, the chi-square 
value computed for each table is directly proportional to the total sample size. 


Table 16.22 Summary of Data for Example 16.1 With Reduced Sample Size 


Helped Did not help Row sums 
the confederate the confederate 
Noise 15 35 50 
No noise 30 20 50 
Total 
Column sums 45 55 observations 100 


A number of different measures of association/correlation that are independent of sample 
size can be employed as indices of the magnitude of a treatment effect for an r x c contingency 
table. In this section the following measures of association will be described: a) Test 16f: The 
contingency coefficient; b) Test 16g: The phi coefficient; c) Test 16h: Cramér’s phi coef- 
ficient; d) Test 16i: Yule’s Q; and e) Test 16j: The odds ratio. 

As a general rule (although there are some exceptions), the value computed for a measure 
of association/correlation will usually fall within a range of values between 0 and +1 or between 
—] and -1. Whereas a value of 0 indicates no relationship between the two variables, an absolute 
value of 1 indicates a maximum relationship between the variables. Consequently, the closer the 
absolute value of a measure of association is to 1, the stronger the relationship between the 
variables. As noted above, some measures of association can assume values in the range between 
—] and +1. In such cases, the absolute value of the measure indicates the strength of the 
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relationship, and the sign of the measure indicates the direction of the relationship — i.e., the 
pattern of observations among the cells of a contingency table. Before continuing this section 
the reader may find it useful to read Section I of the Pearson product-moment correlation 
coefficient, which provides a general discussion of correlation/association. 

Measures of association for r x c contingency tables can be evaluated with respect to 
statistical significance. In the case of a number of the measures of association to be discussed 
in this section, if the computed the chi-square value for the contingency table is statistically sig- 
nificant at a given level of significance, the measure of association computed for the contingency 
table will be significant at the same level of significance. In most instances, the null hypothesis 
and nondirectional alternative hypothesis evaluated with reference to a measure of association 
are as follows:”° 


Null hypothesis The correlation/degree of association between the two variables in the under- 
lying population is zero. 


Alternative hypothesis The correlation/degree of association between the two variables in the 
underlying population is some value other than zero. 


The same guidelines discussed earlier in reference to employing a directional alternative 
hypothesis for the chi-square test for r x c tables can also be applied if one wants to state the 
alternative hypothesis for a measure of association directionally. 

Since the different measures of association that can be computed for a contingency table 
do not employ the same criteria in measuring the strength of the relationship between the two 
variables, if two or more measures are applied to the same set of data they may not yield 
comparable coefficients of association. Although in the material to follow, there will be some 
discussion of factors that can be taken into account in considering which of the various measures 
of association to employ, in most cases one measure is not necessarily superior to another in 
terms of providing information about a contingency table. Indeed, Conover (1980, 1999) notes 
that the choice of which measure to employ is based more on prevailing tradition than it is on 
statistical considerations. 


Test 16f: The contingency coefficient (C) The contingency coefficient (also known as 
Pearson’s contingency coefficient) is a measure of association that can be computed for an 
r x c contingency table of any size. The value of the contingency coefficient, which will be 
represented with the notation C, is computed with Equation 16.14. 


C= ees (Equation 16.14) 


Where: X? is the computed chi-square value for the contingency table 
n is the total number of observations in the contingency table 


Since n can never equal zero, the value of C can never equal 1. Consequently, the range 
of values C may assume is 0 < C < +1. One limitation of the contingency coefficient is that its 
upper limit (i.e., the highest value it can attain) is a function of the number of rows and columns 
in the r x c contingency table. The upper limit of C (represented by C aax) can be determined 
with Equation 16.15. 
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.[k-l (Equation 16.15) 


max k 


Where: k represents the smaller of the two values of r and c in the contingency table 


Employing Equation 16.15 for a 2 x 2 contingency table, we can determine that the 
maximum value for such a table is Chas = /Q - 1)/2 = .71. 


Employing Equation 16.14, the value C = .29 iscomputed as the value of the contingency 
coefficient for Examples 16.1/16.2. 


"P 1818 5g 
18.18 « 200 


Note that the value C = .29 is also obtained for Table 16.21 (which as noted earlier has the 
same effect size but half the sample size as Tables 16.2/16.3): C = /(9.1)/(9.1 + 100) =.29. 

As noted in the introductory remarks on measures of association, the computed value 
C - .29 will be statistically significant at both the .05 and .01 levels, since the computed value 
x? = 18.18 for Tables 16.2/16.3 (as well as the computed value y* = 9.1 for Table 16.21) is 
significant at the aforementioned levels of significance. Thus, one can conclude that in the 
underlying population the contingency coefficient between the two variables is some value other 
than zero. 

The reader should take note of the fact that if the value of n is reduced, but the effect size 
present in Tables 16.1/16.2/16.21 is maintained, at some point the chi-square value will not 
achieve significance. Although in such a case Equation 16.14 will yield the value C = .29, the 
latter value will not be statistically significant. This is the case, since in order for the value of 
C to be significant, the computed value of chi-square must be significant. 

Ott et al. (1992) note that among the disadvantages associated with the contingency 
coefficient is that it will always be less than 1, even when the two variables are totally dependent 
on one another. In addition, contingency coefficients that have been computed for two or more 
tables can only be compared with one another if all of the tables have the same number of rows 
and columns. One suggestion that is made to counteract the latter problems is to employ 
Equation 16.16 to compute an adjusted value for the contingency coefficient. 





C. = —— (Equation 16.16) 


By employing Equation 16.16, if a perfect association between the variables exists, the 
value of C,4 will equal 1. If Equation 16.16 is employed for Examples 16.1/16.2 and Table 
16.21, the value C4 = .29/.71 = .41 is computed. It should be pointed out that although the use 
of C,4 allows for better comparison between tables of unequal size, it still does not allow one to 
compare such tables with complete accuracy. 

In the section that discusses the computation of the power of the chi-square test for r x c 
tables, it notes that Cohen (1977, 1988) has developed a measure of effect size for the chi-square 
test for r x c tables called the w index. It also notes that Cohen (1977, 1988; pp. 224—226) has 
proposed the following (admittedly arbitrary) w values as criteria for identifying the magnitude 
of an effect size: a) A small effect size is one that is greater than .1 but not more than .3; b) A 
medium effect size is one that is greater than .3 but not more than .5; and c) A large effect size 
is greater than .5. 
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Cohen (1977; 1988, p. 222) states that the value computed for a contingency coefficient 
can be converted into the w index through use of the following equation: w = yC?/(1 - C?). 
When the computed value C = .29 is substituted in the latter equation, the resulting value is 
w= /C29)/[1 - (.29?] = .30. Employing Cohen's (1977, 1988) criteria for the w index, 
the value .30 is at the lower limit for a medium effect size. Since the maximum value the con- 
tingency coefficient can attain is 1, and in the case of a 2 x 2 contingency table the maximum 
value C can attain is .71, one can argue that in lieu of C, C a; should be employed in comput- 
ing w. If C; = Alis employed to compute w, the value w = VC41?/[1 - (.41] = .45 is 
obtained, which based on Cohen's (1977, 1988) criteria still falls within the limits of a medium 
effect size. 








Test 16g: The phi coefficient The phi coefficient (represented by the notation c, which is the 
lower case Greek letter phi) is a measure of association that can only be employed with a 2 x 2 
contingency table. The phi coefficient (which is discussed further in Section VII of the Pearson 
product-moment correlation coefficient) is, in actuality, a special case of the latter correlation 
coefficient. Specifically, it is a Pearson product-moment correlation coefficient that is com- 
puted if the values 0 and 1 are employed to represent the levels of two dichotomous variables. 
Although the value of phi can fall within the range —1 to +1, the latter statement must be 
qualified, since the lower and upper limits of phi are dependent on certain conditions. Carroll 
(1961) and Guilford (1965) note that in order for phi to equal -1 or +1, the following two 
conditions must be met with respect to the 2 x 2 contingency table described by the model in 
Table 16.6: (a + b) = (c + d) and (a +c) = (b + d). 
Employing the notation in Table 16.6 for a 2 x 2 contingency table, the value of phi is 
computed with Equation 16.17. 
— 1 | Oo (Equation 16.17) 
(a + bye + Aa + cob + d) 


Since d? = x7/n, many sources compute the value of phi through use of Equation 16.18 
(which can be derived from the equation $? = x?/n). Note that the result of Equation 16.18 will 
always be the absolute value of phi derived with Equation 16.17.” 


b= jx (Equation 16.18) 


Employing Equation 16.17, the value @ = -.30 is computed below for Examples 16.1/16.2 
(it will yield the same value for the data in Table 16.21). 


E (3040) - (70)(60) — 
VGO = 70 60 + 40)80 + 60)(70 + 40) 





If Equation 16.18 is employed, the absolute value of phi is computed tobe @ = /18.18/200 
= .30. As noted in the introductory remarks on measures of association, the computed absolute 
value ọ = .30 will be statistically significant at both the .05 and .01 levels, since the computed 
value x? = 18.18 is significant at the aforementioned levels of significance. Thus, one can 
conclude that in the underlying population the phi coefficient between the two variables is some 
value other than zero.” In addition to the nondirectional alternative hypothesis being supported, 
the directional alternative hypothesis that is consistent with the data is also supported. This is the 
case since the computed value of chi-square is significant at the .10 and .02 levels (i.e., the 
obtained value x? = 18.18 is greater than X00 = 2.71 and Yos = 541). 
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It turns out the absolute value ọ = .30 computed for Examples 16.1/16.2 is almost identical 
to C= .29, the value of the contingency coefficient computed for same set of data. As a general 
rule, although the two values will not be identical, the absolute value of phi and the contingency 
coefficient will be close to one another. 

Cohen (1977; 1988, p. 223) notes that phi is identical in value to the w index discussed in 
the previous section. Thus, Cohen (1977, 1988) has proposed the following (admittedly arbitrary) 
phi values as criteria for identifying the magnitude of an effect size (which are identical to the 
criteria stated earlier for the w index): small effect size: .10 < ọ < .30; medium effect size: 
.30 < ọ < .50; large effect size: @ > .50. Employing Cohen's (1977, 1988) guidelines, the 
observed effect size for Examples 16.1/16.2 (@ = .30) is at the lower limit of a medium effect size 
— i.e., there is a moderate relationship between the two variables employed in the study. 

The use of the phi coefficient is most commonly endorsed in the case of 2 x 2 contingency 
tables involving two variables that are dichotomous in nature. Because of this, among others, 
Guilford (1965) and Fleiss (1981) note that one of the most useful applications of phi is for 
determining the intercorrelation between the responses of subjects on two dichotomous test 
items.” Siegel and Castellan (1988) note that when the two variables being correlated are 
ordered variables, computation of the phi coefficient sacrifices information, and because of this 
under such conditions it is preferable to employ alternative measures of association that are 
designed for ordered tables (such as Goodman and Kruskal’s gamma which is discussed later 
in the book). For a more thorough overview of the phi coefficient, the reader should consult 
Guilford (1965) and Fleiss (1981), who, among other things, discuss various sources who argue 
in favor of employing measures other than phi with 2 x 2 tables. 


Test 16h: Cramér's phi coefficient Developed by Cramér (1946), Cramér's phi coefficient 
(which will be represented by the notation @,) is an extension of the phi coefficient to 
contingency tables that are larger than 2 x 2 tables. Cramér's phi coefficient, which can assume 
a value between 0 and +1, is computed with Equation 16.19. 


2 
= |—_*% _ Equation 16.19 
$c es (Equation ) 


Where: k represents the smaller of the two values of rand c in the contingency table 


The derivation of Cramér's phi coefficient is based on the fact that the maximum value 
chi-square can attain for a set of data is XLax = n(k - 1). Thus, the value ọọ is the square root 
of a proportion that represents the computed value of chi-square divided by the maximum 
possible chi-square value for a set of data. When the computed chi-square value for a set of data 
equals ax , the value 6, = 1 willbe obtained, which indicates maximum dependency between 
the two variables. 

Since for 2 x 2 tables Cramér's phi and the phi coefficient are equivalent (i.e., when 
kz2,0,20- Væ), they will both yield the absolute value of .30 for Examples 16.1/16.2. 
Cramér’s phi is computed below for the 4 x 3 contingency table presented in Example 16.5. 
Since c = 3 is less than r = 4, k = c = 3. 


59.16 
= | Re ee el 118 
9e (280)G - 1) "m 


The computed value d. = .325 is significant at both the .05 and .01 levels, since the 
computed value y? = 59.16 is significant at the aforementioned levels of significance. Thus, one 
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can conclude that in the underlying population, the value of Cramér's phi coefficient is some 
value other than zero. 

Cohen (1977; 1988, p. 223) states that the value computed for Cramér's phi coefficient 
can be converted into the previously discussed w index through use of the following equation: 
w = $c yk - 1. When the computed value , = .325 is substituted in the latter equation, 
the resulting value is w =(.325)/3 - 1 = .46. Employing Cohen’s (1977, 1988) criteria for 
the w index, the value w = .46 falls within the upper region of the range of values listed for a 
medium effect size. 

Daniel (1990) notes that when a contingency table is square (i.e., r= c) and d c = 1, there 
is a perfect correlation between the two variables (which will be reflected by the fact that all of 
the observations will be in the cells of one of the diagonals of the table). When r # c and 
Pc = 1, however, the two variables will not be perfectly correlated in the same manner as is the 
case with a square contingency table. Conover (1980. 1999) notes that although under all con- 
ditions the possible range of values for @, will be between 0 and +1, its interpretation will 
depend on the values of r and c. He states that there is a tendency for the value of x? (and 
consequently the value of 6.) to increase as the values of r and c become larger. For this reason 
Conover (1980, 1999) suggests that . may not be completely accurate for comparing the degree 
of association in different size tables. Daniel (1990) notes that when r = c = 2, the value of d. 
is equal to the square of the tie-adjusted value of Kendall’s tau (Test 30) (discussed later in the 
book) computed for the same set of data. As is the case with the phi coefficient, when ordered 
categories are employed for both variables, sources do not recommend employing Cramér’s phi 
since it sacrifices information. In designs involving variables with ordered categories, it is pref- 
erable to employ an alternative measure of association such as Goodman and Kruskal’s 
gamma. 


Test 16i: Yule’sQ Yule's Q (Yule (1900)) is a measure of association for a 2 x 2 contingency 
table. It is presented in this section to illustrate that if two or more measures of association are 
computed for the same set of data, they may not yield comparable values. Since it employs less 
information than the phi coefficient, Yule’s Q is less frequently recommended than phi as a 
measure of association for 2 x 2 tables. Yule's Q is actually a special case of Goodman and 
Kruskal’s gamma (although unlike gamma, which is only used with ordered contingency 
tables, Yule’s Q can be used for both ordered and unordered tables). 

Employing the notation in Table 16.6 for a 2 x 2 contingency table, Equation 16.20 is 
employed to compute the value of Yule's Q. 


ad - bc 
ad + bc 


Q = (Equation 16.20) 


Sources that discuss Yule's Q generally note that it tends to inflate the degree of association 
in the underlying population. Ott et al. (1992) note that an additional limitation is that if the 
absolute value of Q equals 1, it does not necessarily mean there is a perfect association between 
the two variables. In point of fact, if the observed frequency of any of the four cells in a 2 x 2 
contingency table equals 0, the value of Yule's Q will equal either —1 or +1. For this reason, the 
meaning of Q can be quite misleading in cases where the frequency of one of the cells is equal 
to 0. Because of the latter, it is not recommended that Yule's Q be employed when there is a 
small number of frequencies in any of the four cells of a 2 x 2 contingency table. 

Employing Equation 16.20, the value Q = —56 is computed for Examples 16.1/16.2 (as 
well as for the data in Table 16.21). 
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. 30)(40) - (70)(60) — e 


9 (30.40) + (70)(60) 


Note that although both the values of Q and phi suggest the presence of a negative assoc- 
lation between the two variables, the absolute value Q = .56 is almost twice the value ọ = .30 
computed previously for the same set of data. When Cohen's (1977, 1988) criteria are applied 
to the computed value of phi it indicates the presence of a medium effect size, yet if the same 
criteria are applied to Q = .56, it suggests the presence of a large effect size. (It should be noted, 
however, that Cohen does not endorse the use of the values he lists for phi with Yule's Q.) 

Ott et al. (1992) note that the significance of Yule's Q can be evaluated with Equation 
16.21.” 





Z5 Q (Equation 16.21) 
1 1 1 1 1 
(1 -Q| +++ 
4 Q) f b c 1 


Employing Equation 16.21 with the data for Examples 16.1/16.2, the value z 2 —5.46 is 
computed. 


ee ERK = 546 


B - C56y? x pobog Lo | 





30 70 60 40 


The obtained value z = —5.46 is evaluated with Table A1. Since the obtained absolute 
value z = 5.46 is greater than the tabled critical two-tailed .05 and .01 values zo, = 1.96 and 
Zo, = 2.58, the nondirectional alternative hypothesis is supported at both the .05 and .01 levels. 
Since z = 5.46 is greater than the tabled critical one-tailed .05 and .01 values Zo; = 1.65 and 
Zo, = 2.33, the directional alternative hypothesis that is consistent with the data is also supported 
at both the .05 and .01 levels. 


Test 16j: The odds ratio (and the concept of relative risk) As is the case with Yule's Q (but 
unlike C, q, and @,), the odds ratio (which is attributed to Cornfield (1951)) is a measure of 
association employed with contingency tables that is not a function of chi-square. Although the 
odds ratio, which will be represented by the notation o, can be applied to any size contingency 
table, it is easiest to interpret in the case of 2 x 2 tables. The odds ratio expresses the degree of 
association between the two variables in a different numerical format than all of the previously 
discussed measures of association. In some respects it provides a more straightforward way of 
interpreting the results of a contingency table than the correlational measures of association 
discussed previously. 

The odds ratio is one of two measures that are commonly employed in epidemiological 
research to indicate the risk of a person contracting a disease. Although relative risk, which is 
the second of the two measures, is intuitively easier to understand, the odds ratio is more useful 
when employed within the context of statistical analysis. In this section both of the aforemen- 
tioned measures will be examined. In order to do this, consider the data in Table 16.23. We will 
assume that the latter table represents the results of a hypothetical study which evaluates if 
someone who washes her hands immediately after handling a diseased animal is less likely to 
contract the disease than someone who does not wash her hands. 

Note that Table 16.23 is identical to Table 16.2, except for the following: a) In the case of 
the row variable, the noise condition has been replaced by a washes hands condition, and the no 
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noise condition has been replaced by a does not wash condition; b) In the case of the column 
variable, the helped the confederate response category has been replaced by the contracts the 
disease category, and the did not help the confederate response category has been replaced by 
the does not contract the disease category. Thus, the independent variable in the study sum- 
marized in Table 16.23 is whether or not a person washes her hands, and the dependent 
variable is whether or not the person contracts the disease. 


Table 16.23. Summary of Hand Washing Study 


Contracts Does not Row sums 
the disease contract 
the disease 
Washes hands 30 70 100 
Does not wash 60 40 100 
Total 
Column sums 90 110 observations 200 


Relative risk allows a researcher to compare the relative probabilities of contracting a 
disease. Specifically, relative risk is the probability of contracting a disease if you are a member 
of one group (typically the group that is considered to have the higher risk) divided by the prob- 
ability of contracting the disease if you are a member of the other group (typically the group that 
is considered to have the lower risk). In the case of our example, the probability of contracting 
the disease in the does not wash group (which will be considered the high risk group) is 
60/100 = .6. The latter probability is simply the number of people in the does not wash group 
who contract the disease (60) divided by the total number of people who constitute the does 
not wash group (100). In the same respect, the probability of contracting the disease in the 
washes hands group (which will be considered the low risk group) is 30/100 = .3. The latter 
probability is simply the number of people in the washes hands group who contract the disease 
(30) divided by the total number of people who constitute the washes hands group (100). 
Through use of the notation in Table 16.6, the above can be expressed as follows. 


p(Contracts disease/Does not wash) = cl(c + d) 
p(Contracts disease! Washes hands) = al(a + b) 


The relative risk (RR) is computed with Equation 16.22. 
(Equation 16.22) 
RR = cl(c +d) _ 60/(60 + 40) _ 2 
al(a + b) |30/(30 + 70) 


or 


ac + bc _ (30)(60) + (7060) _ 2 
ac + ad  (30)(60) + (30)(40) 


RR = 





The value RR = 2 computed for relative risk means that a person who does not wash her 
hands is 2 times more likely to contract the disease than a person who washes her hands. If the 
numerator and denominator of Equation 16.22 are reversed (which computes the relative risk for 
someone who washes relative to someone who does not wash), the value RR z .5 is computed. 
The value RR = .5 indicates that a person who washes her hands has half (.5) the likelihood of 
contracting the disease than a person who does not wash her hands. 
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If we apply the computations for relative risk to the variables employed in Example 16.1, 
we can state that someone in the no noise condition is 2 times more likely to help the con- 
federate than someone in the noise condition. 

At this point we will turn our attention to the odds ratio. Before considering the latter 
measure it will be useful to clarify the concept of odds. The odds that an event (to be designated 
X) will occur are computed by dividing the probability that the event will occur by the probability 
that the event will not occur. The latter is described by Equation 16.23. 


P(X will occur) 


Odds So ae e 
o p(X will not occur) 


(Equation 16.23) 


The following should be noted with respect to odds: a) The computed value for odds can 
fall anywhere in the range 0 to infinity; b) When the value computed for odds is greater than 1, 
it indicates that the probability of an event occurring is better than one-half (1/2). Thus, the 
larger the odds of an event occurring, the higher the probability that the event will occur; c) 
When the value computed for odds is less than 1, it indicates that the probability of an event 
occurring is less than one-half (1/2). Thus, the smaller the odds of an event occurring, the lower 
the probability that the event will occur. The lowest value that can be computed for odds is 0, 
which indicates that the probability the event will occur is 0; d) When the value computed for 
odds equals 1, it indicates that the probability of the event occurring is one half (1/2) — i.e., there 
is a 50-50 chance of the event occurring; and e) More often than not, when odds are published 
in the media they are the odds that an event will not occur rather than the odds that the event will 
occur. As an example, if the bookmakers’ odds of a horse winning the Kentucky Derby are 4 
to 1, it indicates that the bookmakers are saying there is only a .2 probability the horse will win 
the derby, and a .8 probability the horse will not win. Consequently, the odds of the horse not 
winning the derby are computed as follows: p(Will not win)/p(Will win) = .8/.2 = 4, which is 
generally expressed as odds of 4 to 1 (often written as 4:1). If in the opinion of the bookmakers 
the likelihood that a horse will win the derby is 2 out of 5, then the odds are computed by 
dividing the likelihood the horse won't win (3/5) by the likelihood the horse will win (2/5), 
which is 3/2 (or 1.5). The latter odds are expressed as 1.5 to 1 (or 1.5:1) or more commonly as 
3 to 2 (or 3:2).? 

Odds will now be computed for the data in Tables 16.2/16.23. 


p(Help/No noise) _ 60/100 — 


Odds(Help/No noise) = : 
P(Not help! No noise) 40/100 





Odds(Help!Noise) - —EXHelp!Noise) _ 30/100 _ ^g 


P(Not help / Noise) 70/100 





p(Contracts disease/Does not wash) 


Odds(Contracts disease/Does not wash) = á 
p(Does not contract disease! Does not wash) 








p(Contracts disease / Washes hands) 


Odds(Contracts disease / Washes hands) = ———————————————————————————— 
p(Does not contract disease! Washes hands) 
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The above results indicate that the odds of helping in the no noise condition are 1.5 to 1. 
Since the value 1.5 is greater than 1, it indicates the probability of a subject helping in the no 
noise condition is greater than 1/2. The odds of helping in the noise condition are .429 to 1. 
Since the value .429 is less than 1, it indicates the probability of a subject helping in the noise 
condition is less than 1/2. In the same respect the odds of contracting the disease in the does 
not wash condition are 1.5 to 1, while the odds of contracting the disease in the washes hands 
condition are .429 to 1. 

The odds ratio is simply the ratio between the two odds computed for a contingency table. 
Typically the larger number is divided by the smaller number. Thus o =1.5/.429 = 3.5. In the 
case of Example 16.1, the latter value indicates that the odds of helping in the no noise condition 
are 3.5 times larger than the odds of helping in the noise condition. In the case of the data in 
Table 16.23, the value o = 3.5 indicates that the odds of contracting the disease in the does not 
wash condition are 3.5 times larger than the odds of contracting the disease in the washes 
hands condition.” 

It can be algebraically demonstrated that Equation 16.24 (which employs the notation in 
Table 16.6) provides a simple method for computing the odds ratio for a2 x 2 contingency table. 
The latter equation is employed below to obtain the value o = 3.5 for our data.” 


o = bc . (060 _ 35 (Equation 16.24) 


ad | (30)(40) 


Various sources (e.g., Pagano and Gauvreau (1993)) note that when the probability of an 
event occurring is very low, the values in cells a and c of Tables 16.6/16.23 will be very small. 
When the latter is true, the values computed for the relative risk and odds ratio will be very 
close together, since if the values of a and c equal 0, the equation RR = (ac + bc)/(ac + ad) for 
computing relative risk reduces to Equation 16.24. Although the latter is not true for the example 
under discussion, a case where the relative risk and odds ratio are, in fact, almost identical is 
described in Section IX (the Appendix) of the Pearson product-moment correlation 
coefficient under the discussion of meta-analysis and related topics. 

Although, as noted earlier, the odds ratio can be extended beyond 2 x 2 tables, it becomes 
more difficult to interpret with larger contingency tables. However, in instances where there are 
more than two rows but only two columns, its interpretation is still relatively straightforward. 
To illustrate, assume that in Example 16.1 we have three noise conditions instead of two — spe- 
cifically, loud noise, moderate noise, and no noise. Table 16.24 depicts a hypothetical set of 
data that summarizes the results of such a study. 


Table 16.24. Summary of Data for a 3 x 2 Contingency Table 


Helped Did not help Row sums 
the confederate the confederate 
Loud noise 30 70 100 
Moderate noise 50 50 100 
No noise 80 20 100 
Total 
Column sums 160 140 observations 300 


Within any of the three noise conditions, we can determine the odds that someone in that 
condition will help the confederate. This is accomplished by dividing the proportion of subjects 
in a given condition who helped the confederate by the proportion of subjects in the condition 
who did not help the confederate. Thus, for the loud noise condition the odds that someone will 
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help the confederate are (30/100)/(70/100) 2 .43. For the moderate noise condition the odds 
that someone will help the confederate are (50/100)/(50/100) 2 1. For the no noise condition 
the odds that someone will help the confederate are (80/100)/(20/100) 2 4. From these values 
we can compute the following three odds ratios: a) The odds ratio of someone in the no noise 
condition helping the confederate versus someone in the loud noise condition: o = 4/.43 = 9.3; 
b) The odds ratio of someone in the no noise condition helping the confederate versus 
someone in the moderate noise condition: o = 4/1 = 4; and c) The odds ratio someone in the 
moderate noise condition helping the confederate versus someone in the loud noise condition: 
o = 1/43 = 2.33. 

Thus, the odds of someone in the no noise condition helping the confederate are 9.3 times 
larger than the odds of someone in the loud noise condition helping the confederate, and 4 
times larger than the odds of someone in the moderate noise condition helping the confederate. 
The odds of someone in the moderate noise condition helping the confederate are 2.33 times 
larger than the odds of someone in the loud noise condition helping the confederate. 


Test 16j-a: Test of significance for an odds ratio and computation of a confidence interval 
for an odds ratio Christensen (1990) and Pagano and Gauvreau (1993) note that Equation 
16.25 can be employed to evaluate the null hypothesis that the true value of the odds ratio in the 
underlying population is equal to 1 (i.e., that the probability an event will occur is equal to the 
probability the event will not occur). Since the sampling distribution of the odds ratio is 
positively skewed, a logarithmic scale transformation is employed in computing the test statistic, 
which is a standard normal deviate (i.e., a z score). (Logarithmic scale transformations are dis- 
cussed in Section VII of the ¢ test for two independent samples.) Without the logarithmic 
transformation, the numerator of Equation 16.25 would be (o — 1), where o is the computed value 
of the odds ratio, and 1 is the expected value of the odds ratio if the probabilities for the event 
occurring and not occurring are equal. By virtue of the logarithmic transformation, the 
numerator of Equation 16.25 becomes the natural logarithm (which is defined in Endnote 5 in 
the Introduction) of the odds ratio minus 0, which is the natural logarithm of 1. 


. In@) - 0 


Equation 16.25 
SE (Eq ) 


Where:  1n(o) represents natural logarithm of the computed value of the odds ratio 
SE represents the standard error, which is the estimated standard deviation of the 
sampling distribution. The standard error is computed as follows:”® 


SE - 1,1,1,1 
a b c d 








If we evaluate the null hypothesis regarding the odds ratio in reference to the data for 
Example 16.1, the following values are computed: a) Through use of the appropriate tables or 
a calculator (using the /n key), the natural logarithm of the odds ratio (o = 3.5) is determined to 
be 1.2528; and b) The value of the standard error is computed to be SE = .2988. 


SE - 3,1,21, 1 288 
30 70 60 40 


Substituting the values In(o) = 1.2528 and SE = .2988 in Equation 16.25, the value 
z = 4.19 is computed. 
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The value z = 4.19 is evaluated with Table A1 in the Appendix. In Table A1 the 
tabled critical two-tailed .05 and .01 values are zo, = 1.96 and z,, = 2.58, and the tabled 
critical one-tailed .05 and .01 values are zo, = 1.65 and Zo = 2.33. If a nondirectional al- 
ternative hypothesis is employed, in order to reject the null hypothesis the absolute value of z 
must be equal to or greater than the tabled critical two-tailed value at the prespecificed level of 
significance. If a directional alternative hypothesis is employed, in order to reject the null 
hypothesis the following must be true: a) The absolute value of z must be equal to or greater than 
the tabled critical one-tailed value at the prespecificed level of significance; and b) If the 
alternative hypothesis stipulates that the population odds ratio is less than 1, the sign of the 
computed z value must be negative (since a positive number less than 1 will yield a negative 
value for a logarithm). If the alternative hypothesis stipulates that the underlying population 
odds ratio is greater than 1, the sign of the computed z value must be positive (since a number 
greater than 1 will yield a positive value for a logarithm). Since the computed value z = 4.19 is 
greater than the two-tailed .05 and .01 values zog, = 1.96 and z,, = 2.58, the nondirectional 
alternative hypothesis is supported at both the .05 and .01 levels. Since z = 4.19 is a positive 
number that is greater than the tabled critical one-tailed .05 and .01 values Zo, = 1.65 and 
Zo, = 2.33, the directional alternative hypothesis stipulating that the underlying population odds 
ratio is greater than | is supported at both the .05 and .01 levels. Thus, we can conclude that the 
population odds ratio is not equal to 1. 

Christensen (1990) and Pagano and Gauvreau (1993) describe the computation of a con- 
fidence interval for the odds ratio. The latter computation initially requires that a product based 
on multiplying the standard error by the relevant z value for the confidence interval be obtained. 
Thus, if one is computing the 95% confidence interval, the standard error is multiplied by the 
tabled critical two-tailed .05 value zo, = 1.96. If one is computing the 99% confidence interval, 
the standard error is multiplied by the tabled critical two-tailed .01 value zo, = 2.58. The latter 
product for the desired confidence interval is then added to and subtracted from the natural 
logarithm of the odds ratio. The antilogarithms (which are the original numbers for which the 
logarithms were computed) of the resulting two values are determined. The latter values define 
the limits of the confidence interval. 

To illustrate, the 95% confidence interval will be computed. We first multiply the standard 
error SE = .2988 by the tabled critical two-tailed .05 value zo, = 1.96, and obtain (.2988)(1.96) 
= .5856. The latter value is added to and subtracted from the natural logarithm of the odds ratio, 
which was previously computed to be 1.2528. Thus 1.2528 + .5856 yields the two values .6672 
and 1.8384. Through use of the appropriate tables or a calculator (using the e" key), the anti- 
logarithms of the latter two values are determined to be 1.9488 and 6.2865. These values 
represent the limits that define the 95% confidence interval for the odds ratio. In other words, 
we can be 95% confident (or the probability is .95) that the true value of the odds ratio in the 
population falls between 1.9488 and 6.2865. 


VII. Additional Discussion of the Chi-Square Test for r x c Tables 


1. Simpson’s paradox Simpson’s paradox is where either the direction or magnitude of the 
relationship between two variables (to be designated X and Y) is influenced by a third variable (to 
be designated Z). When the relationship between X and Y is summarized in a two-dimensional 
contingency table, the direction and/or the magnitude of the relationship between the two 
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variables is different from what it appears to be when the data are summarized in a three- 
dimensional contingency table which takes into account all three variables. 

To illustrate, let us assume that a study is conducted which evaluates the efficacy of two 
surgical treatments (to be designated Treatment A and Treatment B) for a seizure disorder. The 
study is conducted at two hospitals to be designated as Hospital 1 and Hospital 2. The neuro- 
surgeons who perform the surgery in Hospital 1 are extremely experienced in using both of the 
surgical techniques. The neurosurgeons at Hospital 2, on the other hand, are relatively inex- 
perienced in using the surgical procedures. However, the researcher who designs the study is 
unaware of differences in experience between the surgeons at the two hospitals. Consequently, 
the researcher conceptualizes the study as having one independent variable, which is the type of 
surgical treatment a patient receives. The dependent variable (which constitutes the second 
variable in the study) is the measure of the efficacy of the treatments — specifically, the cate- 
gorization of a patient as a success or a failure. For purposes of illustration we will assume that 
550 patients participate in the study, and that half of the patients receive Treatment A and the 
other half Treatment B. Of the 550 patients, 370 are treated at Hospital 1, and 180 are treated 
at Hospital 2. Table 16.25 summarizes the results of the study in the form ofa 2 x 2 contingency 
table. 


Table 16.25 Data for Neurosurgery Study in 2 x 2 Contingency Table 


Response to treatment 


Success Failure Totals 
Treatment Treatment A 160 115 275 
catment ‘Treatment B 160 115 275 
Totals 320 230 550 


It should be apparent from inspection of Table 16.25 that the two treatments yield the 
identical number of successes and failures, and thus there is no difference in the efficacy of the 
two treatments. The latter can be confirmed by the fact that if a chi-square test for 
homogeneity is employed to evaluate the data (without using the correction for continuity), it 
yields the value y? = 0 (since the expected and observed frequency for each cell will be equal). 
We can also make the following additional statements about the data: a) A patient has a 160/275 
= .58 probability of responding favorably to Treatment A, and a 160/275 = .58 probability of re- 
sponding favorably to Treatment B; b) If the odds ratio is computed (through use of Equation 
16.24), the value o = [(115)(160)]/[(160)(115)] = 1 is computed. An odds ratio of 1 in reference 
to Table 16.25 indicates that the odds of Treatment A being successful are equal to the odds of 
Treatment B being successful. 

Table 16.26 summarizes the results of the study in the format of a three-dimensional con- 
tingency table. Note that in addition to the two variables taken into account in Table 16.25, 
Table 16.26 includes as a third variable the hospital at which a patient received the treatment. 
Inspection of Table 16.26 reveals that if the hospital that administered the treatment is taken into 
account, one reaches a different conclusion regarding the efficacy of the two treatments. 

Inspection of Table 16.26 reveals the following: a) In Hospital 1 a patient has a 120/145 
= .83 probability of having a successful response to Treatment B, but only a 150/225 = .67 
probability of having a successful response to Treatment A. If the odds ratio is computed for 
the Hospital 1 data, the value o = [(75)(120)]/[(150)(25)] = 2.4 is computed. The latter value in- 
dicates that the odds of Treatment B being successful are 2.4 times larger than the odds of 
Treatment A being successful. If a chi-square test for homogeneity is employed to evaluate the 
2 x 2 table for Hospital 1, it yields the value y? = 11.58, which is significant at both the .05 and 
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Table 16.26 Data for Neurosurgery Study in Three-Dimensional Contingency Table 


Hospital 1 Hospital 2 
Success Failure Success Failure Totals 
Treatment Treatment A 150 75 10 40 275 
catment Treatment B 120 25 40 90 275 
Totals 270 100 50 130 550 


.01 levels (since for df= 1, Xs - 3.84 and Xo = 6.63). The values computed in Hospital 
1 clearly indicate that Treatment B is more successful than Treatment A; b) In Hospital 2 a 
patient has a 40/130 = .31 probability of having a successful response to Treatment B, but only 
a 10/50 z .20 probability of having a successful response to Treatment A. If the odds ratio is 
computed for the Hospital 2 data, the value o = [(40)(40)]/[(10)(90)] = 1.78 is computed. The 
latter value indicates that the odds of Treatment B being successful are 1.78 times larger than the 
odds of Treatment A being successful. If a chi-square test for homogeneity is employed to 
evaluate the 2 x 2 table for Hospital 2, it yields the value X? = 2.09 (which is not significant, 
since X? = 2.09 is less than Cos = 3.84).? The fact remains, however, that the success rate 
in Hospital 2 for Treatment B is higher than it is for Treatment A. The lack of a significant result 
for the Hospital 2 data may be a function of the relatively small sample size, which limits the 
power of the analysis. Thus, there is a suggestion that Treatment B may also be more successful 
than Treatment A in Hospital 2. 

Analysis of the treatments within each of the two hospitals strongly suggests that Treatment 
B is more successful than Treatment A. Yet when the data are pooled and expressed within the 
format of a 2 x 2 contingency table, there is no apparent difference between the two treatments. 
This is a classic example of Simpson's paradox, in that both the magnitude and direction of the 
relationship between two variables (the type of treatment administered and a patient's response 
to the treatment) is influenced by a third variable (the hospital at which a treatment was 
administered). The fact that the success rate for both of the treatments is higher in Hospital 1 
than it is in Hospital 2, reflects the fact that the doctors at Hospital 1 are more skilled in using the 
surgical treatments than the doctors at Hospital 2 (although there is the possibility that the 
lower success rate at Hospital 2 could be due to the fact that a disproportionate number of 
the more difficult cases were treated at that hospital). The superiority of Treatment B over 
Treatment A is masked in Table 16.25, because the latter table inappropriately weights the 
outcomes at the different hospitals. The two factors that are responsible for the inappropriate 
weighting are: a) In Hospital 1 fewer patients received Treatment B than Treatment A, while in 
Hospital 2 a greater number of patients received Treatment B than Treatment A; and b) The 
doctors at Hospital 1 are more experienced (and thus probably more successful in conducting the 
surgery) than the doctors at Hospital 2. The joint influence of these two factors is what is 
responsible for the absence of any apparent differences between the two treatments when the data 
are summarized in a 2 x 2 contingency table (i.e., Table 16.25). Thus, Simpson's paradox 
illustrates that when data based on three variables (typically two independent variables and one 
dependent variable) are collapsed into a 2 x 2 table, it can dramatically distort what really 
occurred in a study. 


2. Analysis of multidimensional contingency tables A multidimensional contingency table 
is one that contains information on three or more variables (such as Table 16.26). Although the 
chi-square test for r x c tables can be generalized for use with multidimensional tables, two 
commonly employed alternative approaches for evaluating multidimensional tables are the log- 
likelihood ratio (which employs a statistic commonly designated by the notation G), and log- 
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linear analysis. As is the case with the tests that have been described which employ the chi- 
square distribution (i.e., the chi-square goodness-of-fit test and the chi-square test for r x c 
tables), both of the aforementioned methods can also be employed for the analysis of one- and 
two-dimensional tables. When the different methods for evaluating contingency tables are em- 
ployed with the same set of data, they generally yield similar results. 

The analysis of multidimensional contingency tables shares a number of things in common 
with the analysis of factorial designs with an analysis of variance, (factorial designs are 
discussed under the between-subjects factorial analysis of variance (Test 27)). Both of the 
aforementioned analyses can be employed to evaluate data where there are two or more inde- 
pendent variables and a dependent variable. In designs in which there are multiple independent 
variables, within the framework of conducting an omnibus test on the complete body of data, 
itis necessary to determine whether there are any interactions present. An interaction is present 
in a set of data when the performance of subjects on one independent variable is not consistent 
across all the levels of another independent variable. In the case of multidimensional 
contingency tables, the concept of interaction can also be generalized to designs where no clear 
distinction is made with respect to whether variables are independent or dependent variables. 
Although the concept of interaction will be discussed in this section, a more thorough discussion 
of it can be found in Section V of the between-subjects factorial analysis of variance. 

The analysis of multidimensional contingency tables is a complex topic, and a full dis- 
cussion of it is beyond the scope of this book. In this section I will limit my discussion of the 
topic to the generalization of the chi-square test for r x c tables to multidimensional tables. 
Specifically, the latter test will be generalized in order to evaluate a three-dimensional con- 
tingency table. The reader should keep in mind that the multidimensional contingency table to 
be evaluated in this section will represent an example of the simplest multidimensional table that 
can be constructed — a table with three variables, with each variable being comprised of two 
levels. As the number of variables (as well as the number of levels/categories per variable) 
increase, the analysis of a multidimensional table becomes increasingly complex. In addition, 
the more variables that are included in a study, the more difficult it becomes to obtain a clear 
interpretation of the results of any analyses that are conducted. 

As noted earlier, it is possible to have a three-dimensional table that summarizes the data 
for two independent variables and a dependent variable.? Example 16.6 (which is similar to 
Example 16.1, except for the fact that it has a second independent variable) will result in a 
multidimensional table of this type. It was also noted that it is possible to have a three- 
dimensional table where no clear-cut distinction is made between the independent and dependent 
variables. Example 16.7 (which is similar to Example 16.2, except for the fact that it describes 
a study involving three variables instead of two variables) will result in a multidimensional table 
of this type. Examples 16.6 and 16.7 will be employed to illustrate the generalization of the chi- 
square test for r x c tables to a three-dimensional contingency table. 


Example 16.6 A researcher conducts a study on altruistic behavior. All subjects who par- 
ticipate in the experiment are males. Each of the 160 male subjects is given a one-hour test 
which is ostensibly a measure of intelligence. During the test the 65 subjects are exposed to 
continual loud noise, which they are told is due to a malfunctioning generator. Another 95 
subjects are not exposed to any noise during the test. Upon completion of this stage of the ex- 
periment, each subject on leaving the room is confronted by a middle-aged man or woman whose 
arm is in a sling. The latter individual asks the subject if he would be willing to help him/her 
carry a heavy package to his/her car. 80 of the subjects are confronted by a male who asks for 
help, while the other 80 are confronted by a female. In actuality, the person requesting help is 
an experimental confederate (i.e., working for the experimenter). The dependent variable in the 
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experiment is whether or not a subject helps the person carry the package. Table 16.27 sum- 
marizes the data for the experiment. Do the data indicate that altruistic behavior is influenced 
by noise and/or the gender of the person requesting help? 


Example 16.7 A researcher wants to determine if there is a relationship between the following 
three variables/dimensions: a) A person's political affiliation — specifically, whether a person 
is a Democrat or a Republican; b) A person's categorization on the personality dimension of 
introversion—extroversion; and c) A person's gender (male versus female). One hundred and 
sixty people are recruited to participate in the study. All of the subjects are given a personality 
test, on the basis of which each subject is classified as an introvert or an extrovert. Each subject 
is then asked to indicate whether he or she is a Democrat or a Republican. The data for 
Example 16.7, which can be summarized in the form of a three-dimensional contingency table, 
are presented in Table 16.27. Do the data indicate that the three variables are independent of 
one another? 


Table 16.27 is a three-dimensional contingency table that simultaneously summarizes the 
data for Examples 16.6 and 16.7. In a three-dimensional table, one of the variables is designated 
as therow variable, a second variable is designated as the column variable, and the third variable 
is designated as the layer variable. (Zar (1999) employs the term tier variable to designate the 
third variable.) In a three-dimension table there will be a total of r x c x l cells, where r 
represents the number of row categories, c the number of column categories, and / the number 
of layer categories. 

As noted above, the experiment described in Example 16.6 has two independent variables. 
One of the independent variables is the noise manipulation, which is comprised of the two 
levels noise versus no noise. In Table 16.27 the noise manipulation independent variable is 
designated as the row variable. The second independent variable is whether a subject is 
confronted by a male or a female confederate. In Table 16.27 the independent variable 
represented by the gender of the confederate is designated as the column variable. The 
dependent variable is whether or not a subject helped the confederate. In Table 16.27 whether 
or not a subject helped, which has the two levels helped and did not help, is designated as the 
layer variable. 

In the case of Example 16.7, where no clear-cut distinction is made with respect to a 
variable being an independent or dependent variable, the row variable will be the categorization 
of a person on the introversion-extroversion dimension, the column variable will be the 
gender of the subject (male-female), and the layer variable will be a person's political 
affiliation (Democrat-Republican). 


Table 16.27 Summary Data for Examples 16.6/16.7 


Helped/Democrat Did not help/Republican 
Male Female Male Female Totals 

Noise/Introvert 10 15 25 15 65 
No Noise/Extrovert 25 45 20 5 95 
Totals 35 60 45 20 160 

Sums: Row 1 (Noise/Introvert) =O, -R,-65 

Row 2 (No Noise/Extrovert) =O, -R,-95 

Column 1 (Male) =O, =C = 80 

Column 2 (Female) =0, =C, = 80 

Layer 1 (Helped/Democrat) =O ,=L, = 95 

Layer 2 (Did not help/Republican =O , = L, = 65 
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Christensen (1990, pp. 63—64) notes that among the possible ways of conceptualizing the 
relationship between the variables in a three-way contingency table are the following: a) Rows, 
columns, and layers are all independent of one another; b) Rows are independent of columns and 
layers (with columns and layers not necessarily being independent of one another); c) Columns 
are independent of rows and layers (with rows and layers not necessarily being independent of 
one another); d) Layers are independent of rows and columns (with rows and columns not 
necessarily being independent of one another); e) Given any specific level of a row, columns and 
layers are independent of one another; f) Given any specific level of a column, rows and layers 
are independent of one another; g) Given any specific level of a layer, rows and columns are 
independent of one another. 

Each of the aforementioned ways of conceptualizing a contingency table is often referred 
to as a model. Of the seven models noted above, Model a is referred to as the model of 
complete (or mutual) independence, Models b, c, and d as models of partial independence, 
and Models e, f, and g as models of conditional independence. The discussion to follow will 
be limited to a description of the analysis of the models of complete and partial independence. 


Test of model of complete independence The initial analysis that will be conducted on Table 
16.27 will evaluate whether in each of the studies all three variables are independent of one 
another. This analysis (which, as noted earlier, assesses complete independence), will be evalu- 
ated with what will be referred to as the omnibus chi-square analysis. Within the framework 
of the latter analysis, the following null and alternative hypotheses will be evaluated. 

Null hypothesis H): In Example 16.6, in the underlying population the three variables (ex- 
posure to noise, gender of confederate, and helping behavior) are all independent of one another. 
In Example 16.7, in the underlying population the three variables (introversion—extroversion, 
gender, and political affiliation) are all independent of one another. The notation H,: Oy 7 Eijk 
for all cells can also be employed, which means that in the underlying population(s) the 
sample(s) represent(s), for each of the r x c x / cells the observed frequency of a cell is equal to 
the expected frequency of the cell. With respect to the sample data, this translates into the 
observed frequency of each of the r x c x I cells being equal to the expected frequency of the 
cell. 


Alternative hypothesis H,: In Example 16.6, in the underlying population the three variables 
(exposure to noise, gender of confederate, and helping behavior) are not all independent of one 
another. In Example 16.7, in the underlying population the three variables (introversion- 
extroversion, gender, and political affiliation) are not all independent of one another. The 
notation Hi: Oy * Eijk for at least one cell can also be employed, which means that in the 
underlying population(s) the sample(s) represent(s), for at least one of the r x c x I cells the 
observed frequency of a cell is not equal to the expected frequency of the cell. With respect to 
the sample data, this translates into the observed frequency of at least one of the r x c x l cells 
not being equal to the expected frequency of the cell. It will be assumed that the alternative 
hypothesis is nondirectional (although it is possible to have a directional alternative hypothesis). 


Equation 16.26 (which is a generalization of Equation 16.2 to a three-dimensional table) 
is employed to compute the test statistic for a three-dimensional table. 


- Ey 
yp (Equation 16.26) 








Ei, 


© 2000 by Chapman & Hall/CRC 


Note that the notation k in the ijk subscript for the observed and expected frequencies in 
Equation 16.26 (as well as in the null and alternative hypotheses) represents the k * layer of the 
layer variable. Thus, in Equation 16.26 the notation E; , means the expected frequency for the 
cell in Row i, Column j, and Layer k. Note that just as there are r levels on the row variable and 
c levels on the column variable, there are / levels on the layer variable. 

The operations described by Equation 16.26 (which are the same as those described for 
computing the chi-square statistic for the chi-square goodness-of-fit test and the chi-square test 
for r x c tables) are as follows: a) The expected frequency of each cell is subtracted from its 
observed frequency; b) For each cell, the difference between the observed and expected 
frequency is squared; c) For each cell, the squared difference between the observed and expected 
frequency is divided by the expected frequency of the cell; and d) The value of chi-square is 
computed by summing all of the values computed in part c). 

Note that in contrast to Equation 16.26, there is only one summation sign in the analogous 
equation for the chi-square goodness-of-fit test (Equation 8.2), since the latter test has only a 
single variable (designated as the row variable). In the same respect there are two summation 
signs for the analogous equation for the chi-square test for r x c tables (Equation 16.2), since 
the latter test has two variables — a row variable and a column variable. Since there are three 
variables, the three-dimensional equation requires summing over all three dimensions. (If there 
are four dimensions the chi-square equation will have four summation signs, since summing will 
have to be done over all four dimensions. Five dimensions will require five summation signs, 
and so on.) 

The protocol for conducting the chi-square analysis is identical to that employed for a two- 
dimensional table. The only aspect of the analysis that is different is the computation of the 
expected frequency of a cell, which in the case of a three-dimensional table is computed with 
Equation 16.27. Note that Equation 16.27 is written two ways. The first representation of the 
equation is consistent with the format used in Equation 16.1, the equation for computing the 
expected frequency of a two-dimensional table. Thus, O, represents the number of 
observations in the i” row, which can be represented more simply as R,. O, represents the 
number of observations in the j column, which can be represented more simply as C;. O, 
represents the number of observations in the k layer, which can be represented more simply 
as L, The totals for the rows, column, and layers are recorded at the bottom of Table 16.27. 


E.., = 0, 0,0 x _ RCL, 


ijk 
n? n? 


(Equation 16.27) 


The notation in Equation 16.27 indicates that to compute the expected frequency of a cell, 
the observed frequency of row the cell is in is multiplied by the observed frequency of the 
column the cell is in, and the resulting value is multiplied by the observed frequency of the layer 
the cellis in. The resulting product is divided by the square of the total number of observations 
in the contingency table. 

To illustrate, we will compute the expected frequency for the cell in the upper left of Table 
16.27. The latter cell, which is Cell, (i.e., the cell in Row 1, Column 1, and Layer 1), represents 
subjects in the noise/introvert category, the male category, and the helped/Democrat category. 
We thus multiply the total number of observations in Row 1 (which is 65) by the total number of 
observations in Column 1 (which is 80) by the total number of observations in Layer 1 (which is 
95). The product is divided by the square of the total number of observations in the contingency 
table (the total number of observations being 160). Thus: Ej, = [(65)(80)(95)]/(160)? = 19.30. 
The expected frequencies of the remaining seven cells are computed below. 
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E21 = [(65)(80)(95)V/(160)* = 19.30 
E, = [(65)(80)(65) (160)? = 13.20 
E, = [(65)(80)(65) (160)? = 13.20 
E,,, = [(95)(80)(95)/(160)’ = 28.20 
E», = [(95)(80)(95)/(160)’ = 28.20 
E, = [(95)(80)(65) (160)? = 19.30 
E», = [(95)(80)(65) (160)? = 19.30 


The chi-square analysis from this point on is identical to that employed for a two- 
dimensional table. The analysis is summarized in Table 16.28. The cell identification codes in 
the table are N/I = Noise/Introvert; NN/E = No noise/Extrovert; M = Male; F = Female; H/D 
= Helped/Democrat; DNH/R = Did not help/Republican. 

The degrees of freedom employed in the omnibus analysis of a three-dimensional 
contingency table are df = rcl - r - c — l + 2?! Thus, for our example, since r = c = l = 2, 
df = (2)(2)(2) -2 -2 -2 +2 =4. Employing Table A4, the tabled critical .05 and .01 chi- 
square values for df = 4 are Ys - 9.49 and on = 13.28. Since the computed value 
X? = 37.24 is greater than both of the aforementioned critical values, the null hypothesis can be 
rejected at both the .05 and .01 levels. By virtue of rejecting the null hypothesis, the researcher 
can conclude that in Examples 16.6/16.7 the three variables are not independent of one another. 


Table 16.28 Chi-Square Summary Table for Omnibus Analysis of Examples 16.6/16.7 


O,.-E.y 

Cell On Ei (Oi, - Ey) (Ort _ E ( 5 jo 
ijk 

111 (N/LM,H/D) 10 19.30 -9.3 86.49 4.48 
121 (N/LF,H/D) 15 19.30 -4.3 18.49 .96 
112(N/LM,DNH/R) 25 13.20 11.8 139.24 10.55 
122(N/LF,DNH/R) 15 13.20 1.8 3.24 25 
211(NN/E,M,H/D) 25 28.20 -32 10.24 36 
221(NN/E,F,H/D) 45 28.20 16.8 282.24 10.01 
212(NN/E,M,DNH/R) X 20 19.30 4 49 03 
222(NN/E,F,DNH/R) 5 19.30 —14.3 204.49 10.60 
Sums 160 160 0 X = 37.24 


Test of models of partial independence If the evaluation of the model of complete inde- 
pendence through use of the omnibus chi-square analysis yields a significant result, a researcher 
should conduct additional analyses in order to further clarify the nature of the relationship 
between the three variables. As noted earlier, among the analyses that can be conducted are those 
for partial independence, which determine whether one variable is independent of the other two 
variables. Thus, we can determine the following: a) Whether rows are independent of columns 
and layers; b) Whether columns are independent of rows and layers; and c) Whether layers are 
independent of rows and columns. The latter three analyses will now be conducted. 


Test of independence of rows versus columns andlayers Equation 16.26 is employed 
to determine whether the row variable is independent of the column and layer variables. The only 
difference in the analysis to be described in this section from the omnibus analysis conducted in 
the previous section is the computation of the expected frequency for each cell. Equation 16.28 
is employed to compute the expected frequency of a cell. 
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" (0, X0,0 n - (Ri) (CL,) 


n n 


E, 


ijk (Equation 16.28) 

The notation in Equation 16.28 indicates that to compute the expected frequency of a cell, 
the number of observations for the row in which that cell appears is multiplied by the number of 
observations that are in both the column and layer designated for that cell. The resulting product 
is divided by the total number of observations in the contingency table. 

To illustrate, we will compute the expected frequency for the cell in the upper left of Table 
16.27. The latter cell, which is Cell, ,, (i.e., the cell in Row 1, Column 1, and Layer 1), represents 
subjects in the noise/introvert category, the male category, and the helped/Democrat category. 
We thus multiply the total number of observations in Row 1 (noise/introvert) (which is 65) by 
the total number of observations in both Column 1 and Layer 1 (i.e., observations in the male 
column that are in the helped/Democrat layer) (which is 35). The product is divided by the total 
number of observations in the contingency table (which is 160). Thus: Ej, =[(65)(35)]/(160) 
= 14.22. The expected frequencies of the remaining seven cells are computed below. 


E,5, = [(65)(60)]/(160) = 24.38 
Ej, = [(65)(45)]/(160) = 18.28 
Ej, -[(65)20)]/(160) = 8.13 
Ej, = [(95)(35)]/(160) = 20.78 
E, = [(95)(60)]/(160) = 35.63 
Ej, = [(905)(45)]/(160) = 26.72 
E,» = [(95)(20)/(160) = 11.88 


The chi-square analysis to determine whether the row variable is independent of the column 
and layer variables is summarized in Table 16.29. 


Table 16.29 Chi-Square Summary Table for Rows versus Columns/Layers 
Analysis of Examples 16.6/16.7 


O., -EF 

Cell On Ein (Orn = Ej) i " EXP ( ijk 7 jo 
ijk 

111 (N/LM,H/D) 10 14.22 -4.22 17.18 1.25 
121 (N/LF,H/D) 15 24.38 —9.38 87.98 3.61 
112(N/LM,DNH/R) 25 18.28 6.72 45.16 2.47 
122(N/LF,DNH/R) 15 8.13 6.87 47.20 5.81 
211(NN/E,M,H/D) 25 20.78 4.22 17.81 .86 
221(NN/E,F,H/D) 45 35.63 9.37 87.80 2.46 
212(NN/E,M,DNH/R) 20 26.72 —6.72 45.16 1.69 
222(NN/E,F,DNH/R) 5 11.88 —6.88 47.33 3.98 
Sums 160 160 0 x? 222.13 


The degrees of freedom employed in the analysis are df = rcl - cl - r + 1 = (2)(2)(2) - 
(220) -2 + 1 = 3.? Employing Table A4, the tabled critical .05 and .01 chi-square values for 
df =3 are os = 7.81 and X5i = 11.34. Since the computed value x? = 22.13 is greater than 
both of the aforementioned critical values, the null hypothesis can be rejected at both the .05 and 
.01 levels. By virtue of rejecting the null hypothesis, the researcher can conclude that the row 
variable (noise manipulation/introversion-extroversion) is not independent of the column 
(gender) and layer (helping/political affiliation) variables. 
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Test of independence of columns versus rows and layers Equation 16.26 is employed 
to determine whether the column variable is independent of the row and layer variables. The 
only difference in the analysis to be described in this section and the previous analyses described 
is the computation of the expected frequency for each cell. The data have been reorganized in 
Table 16.30 to facilitate the computation of the expected frequencies for the analysis of 
independence of columns versus rows and layers. 


Table 16.30 Summary of Data for Examples 16.6/16.7 for Columns 
versus Rows/Layers Analysis 


Helped/Democrat Did not help/Republican 
Noise/Introvert No Noise/Extrovert Noise/Introvert No Noise/Extrovert Totals 
Male 10 25 25 20 80 
Female 15 45 15 5 80 


Totals 25 70 40 25 160 


Equation 16.29 is employed to compute the expected frequency of a cell. 


|(0,X0,0 9. CYR Ly) 


n n 


E. 


ijk 


(Equation 16.29) 


The notation in Equation 16.29 indicates that to compute the expected frequency of a cell, 
the number of observations for the column in which that cell appears is multiplied by the number 
of observations that are in both the row and layer designated for that cell. The resulting product 
is divided by the total number of observations in the contingency table. 

To illustrate, we will compute the expected frequency for the cell in the upper left of Table 
16.30. The latter cell, which is Cell,,, (i.e., the cell in Row 1, Column 1, and Layer 1) represents 
subjects in the noise/introvert category, the male category, and the helped/Democrat category. 
We thus multiply the total number of observations in Column 1 (males) (which is 80) by the total 
number of observations in both Row 1 and Layer 1 (i.e., observations in the noise/introvert 
column that are in the helped/Democrat layer) (which is 25). The product is divided by the total 
number of observations in the contingency table (which is 160). Thus: Ej, = [(80)(25)]/(160) 
= 12.5. The expected frequencies of the remaining seven cells are computed below. 


[(80)25)]/(160) = 12.5 
[(80)(40)]/(160) = 20 

i3 = K(80)(40)]/(160) = 20 

5 = 180)070)]/160) = 35 
[ 
[ 
[ 


12 
11 


1 
2 


E 
E 
E 
E 


Ej, = ((80)(70)]/(160) = 35 
Ej, = ((80)25))//(160) = 12.5 
(80)25)]/(160) = 12.5 


212 
Em 


The chi-square analysis to determine whether the column variable is independent of the row 
and layer variables is summarized in Table 16.31. 

The degrees of freedom employed in the analysis are df = rcl — rl - c + 1 = (2)(2)Q) - 
(2X(2)- 2+1 =3.” Employing Table A4, the tabled critical .05 and .01 chi-square values for 
df=3 are Xos = 7.81 and Xo = 11.34. Since the computed value X? = 18.22 is greater than 
both of the aforementioned critical values, the null hypothesis can be rejected at both the .05 and 
.01 levels. By virtue of rejecting the null hypothesis, the researcher can conclude that the column 
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Table 16.31  Chi-Square Summary Table for Columns versus Rows/Layers 
Analysis of Examples 16.6/16.7 


O,.-E.y 

Cell Oii. Ei (Oi, = Ei) (Oii, - Ea E 
ijk 

111 (N/I,M,H/D) 10 12.50 -2.5 6.25 .50 
121 (N/LF,H/D) 15 12.50 2:5 6.25 .50 
112(N/LM,DNH/R) 25 20.00 5.0 25.00 1.25 
122(N/LF,DNH/R) 15 20.00 —5.0 25.00 1.25 
211(NN/E,M,H/D) 25 35.00 —10.0 100.00 2.86 
221(NN/E,F,H/D) 45 35.00 10.0 100.00 2.86 
212(NN/E,M,DNH/R) 20 12.50 7.5 56.25 4.50 
222(NN/E,F,DNH/R) 5 12.50 -7.5 56.25 4.50 
Sums 160 160 0 X? = 18.22 


variable (gender) is not independent of the row (noise manipulation/introversion- 
extroversion) and layer (helping/ political affiliation) variables. 

Test of independence of layers versus rows and columns Equation 16.26 is employed 
to determine whether the layer variable is independent of the row and column variables. The 
only difference in the analysis to be described in this section and the previous analyses described 
is the computation of the expected frequency for each cell. The data have been reorganized in 
Table 16.32 to facilitate the computation of the expected frequencies for the analysis of 
independence of layers versus rows and columns. 


Table 16.32 Summary of Data for Examples 16.6/16.7 for Layers 
versus Rows/Columns Analysis 


Noise/Introvert No Noise/Extrovert 
Male Female Male Female Totals 
Helped/Democrat 10 15 25 45 95 
Did not Help/Republican 25 15 20 5 65 
Totals 35 30 45 50 160 


Equation 16.30 is employed to compute the expected frequency of a cell. 


(0 X0, 0,)  (L)(R,O) 


n n 


ijk (Equation 16.30) 

The notation in Equation 16.30 indicates that to compute the expected frequency of a cell, 
the number of observations for the layer in which that cell appears is multiplied by the number 
of observations that are in both the row and column designated for that cell. The resulting 
product is divided by the total number of observations in the contingency table. 

To illustrate, we will compute the expected frequency for the cell in the upper left of Table 
16.32. The latter cell, which is Cell,,, (i.e., the cell in Row 1, Column 1, and Layer 1), represents 
subjects in the noise/introvert category, the male category, and the helped/Democrat category. 
We thus multiply the total number of observations in Layer 1 (helped/Democrat) (which is 95) 
by the total number of observations in both Row 1 and Column 1 (i.e., observations in the noise/ 
introvert row that are in the male column) (which is 35). The product is divided by the total 
number of observations in the contingency table (which is 160). Thus: Æ; = [(95)(35)]/(160) 
= 20.78. The expected frequencies of the remaining seven cells are computed below. 
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Ej, = ((05)(30)]/(160) = 17.81 
Ej, = [(65)(35)]/(160) = 14.22 
Ej5, = [(65)(30)]/(160) = 12.19 
Ej, = [(905)(45)]/(160) = 26.72 
E, = 1(5)(50)]/(160) = 29.69 
Ej = [(65)(45)]/(160) = 18.28 
E, = 1(65)650)]/(160) = 20.31 


The chi-square analysis to determine whether the layer variable is independent of the row 
and column variables is summarized in Table 16.33. 


Table 16.33 Chi-Square Summary Table for Layers versus Rows/Columns 
Analysis of Examples 16.6/16.7 


O., - E.» 

Cell Os. En (Orn zx Ej) (Ou = EXE ( E y? 
ijk 

111 (N/LM,H/D) 10 20.78 -10.78 116.21 5.59 
121 (N/LF,H/D) 15 17.81 -2.81 7.90 44 
112(N/LM,DNH/R) 25 14.22 10.78 116.21 8.17 
122(N/LF,DNH/R) 15 12.19 2.81 7.90 .65 
211(NN/E,M,H/D) 25 26.72 -1.72 2.96 .11 
221(NN/E,F,H/D) 45 29.69 15.31 234.40 7.89 
212(NN/E,M,DNH/R) 20 18.28 1.72 2.96 .16 
222(NN/E,F,DNH/R) 5 20.31 -15.31 234.40 11.54 
Sums 160 160 0 x? = 34.55 


The degrees of freedom employed in the analysis are df = rcl - rc — l + 1 = (2)2)2) - 
(2)(2) -2- 1 2 3^ Employing Table A4, the tabled critical .05 and .01 chi-square values for 
df=3 are Xs - 7.81 and Con = 11.34. Since the computed value X? = 34.55 is greater than 
both of the aforementioned critical values, the null hypothesis can be rejected at both the .05 and 
.01 levels. By virtue of rejecting the null hypothesis, the researcher can conclude that the layer 
variable (helping/political affiliation) is not independent of the row (noise manipulation/ 
introversion-extroversion) and column (gender) variables. 

To clarify the nature of the relationship between the variables with greater precision, it will 
be necessary to conduct additional analyses on the data. To illustrate this, we will just examine 
the data in reference to Example 16.6 in greater detail. Recollect that in the latter study there are 
two independent variables, the noise manipulation and gender, and a dependent variable, which 
is the helping behavior of subjects. Let us assume that prior to the study the experimenter 
predicted the following: a) Subjects exposed to the no noise condition will be more likely to help 
the confederate than subjects exposed to the noise condition; and b) Subjects exposed to a 
female confederate will be more likely to help than subjects exposed to a male confederate. The 
two aforementioned hypotheses are predicting what is referred to as a main effect on both of the 
independent variables. The term main effect (which is discussed in greater detail in Sections I and 
V of the between-subjects factorial analysis of variance) describes the effect of one independent 
variable (also referred to as a factor) on the dependent variable, ignoring any effect any of the 
other independent variables/factors might have on the dependent variable. If the researcher 
considers each independent variable separately, two 2 x 2 contingency tables can be constructed 
to summarize the data. The latter is done with Tables 16.34 and 16.35, which, respectively, 
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Table 16.34. Summary of Data for Example 16.6 Employing 
Only Noise Manipulation Independent Variable 


Helped Did not Help 
the Confederate the Confederate Totals 
Noise 25 40 65 
No Noise 70 25 95 
Totals 95 65 160 


Table 16.35 Summary of Data for Example 16.6 Employing 
Only Gender Independent Variable 


Helped Did not Help 
the Confederate the Confederate Totals 
Male 35 45 80 
Female 60 20 80 
Totals 95 65 160 


summarize the data when the noise manipulation independent variable is considered by itself 
and when the gender independent variable is considered by itself. 

With regard to the predicted main effects, consider the following information that can be 
derived from Tables 16.34 and 16.35. 

Without employing a test of significance on the data (which in this case would be the chi- 
square test for homogeneity/z test for two independent proportions), it appears that a subject 
is more likely to help in the no noise condition than the noise condition. This is the case, since 
the proportion of subjects who helped the confederate in the no noise condition is 70/95 = .74, 
while the proportion of subjects who helped the confederate in the noise condition is only 25/65 
= .38. This clearly suggests the presence of a main effect on the noise manipulation indepen- 
dent variable. In other words, if in analyzing the data the researcher does not bother to consider 
gender as a second independent variable, but considers the noise manipulation as the only 
independent variable, the researcher will conclude that subjects are more likely to help the 
confederate in the no noise condition rather than in the noise condition. 

Without employing a test of significance on the data (which in this case would be the chi- 
square test for homogeneity/z test for two independent proportions), it appears that a subject 
is more likely to help a female confederate than a male confederate. This is the case, since the 
proportion of subjects who helped the female confederate is 60/80 = .75, while the proportion of 
subjects who helped the male confederate is only 35/80 = .44. This clearly suggests the presence 
of a main effect on the gender independent variable. In other words, if in analyzing the data the 
researcher does not bother to consider the noise manipulation as a second independent variable, 
but considers gender as the only independent variable, the researcher will conclude that subjects 
are more likely to help the female confederate rather than the male confederate. 

It was noted earlier in the discussion of Simpson's paradox that when data based on three 
variables are collapsed into two 2 x 2 contingency tables, a distorted picture of what actually 
occurred in a study may result. In the case of Example 16.6, there appears to be a definite inter- 
action between the independent variables of noise and gender. Although for Example 16.6 the 
conclusions that will be reached if one employs two 2 x 2 contingency tables (i.e., Tables 16.34 
and 16.35) are not as skewed as in the example employed to demonstrate Simpson's paradox, 
the use of two 2 x 2 tables still does not present an entirely accurate picture of what occurred in 
the study. To be more specific, the 2 x 2 tables are unable to reveal the interaction between the 
two independent variables. Specifically, consider the following, all of which are summarized in 
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Table 16.36: a) The proportion of subjects who helped who were exposed to noise and a male 
confederate is 10/35 = .29; b) The proportion of subjects who helped who were exposed to noise 
and a female confederate is 15/30 = .50; c) The proportion of subjects who helped who were 
exposed to no noise and a male confederate is 25/45 = .56; and d) The proportion of subjects 
who helped who were exposed to no noise and a female confederate is 45/50 = .90. Note that 
in Table 16.36, the proportion .44 for males in Column 1 is the proportion of all subjects exposed 
to a male confederate who helped (35/80 = .44). The proportion .75 for females in Column 2 
is the proportion of all subjects exposed to a female confederate who helped (60/80 = .75). The 
proportion .38 for noise in Row 1 is the proportion of all subjects exposed to noise who helped 
(25/65 = .38). The proportion .74 for no noise in Row 2 is the proportion of all subjects exposed 
to no noise who helped (70/95 = .74). 


Table 16.36 Summary of Interaction for Example 16.6: 
Proportions of Helping Across Both Independent Variables 


Male Female Row proportions 
Noise .29 50 38 
No Noise 56 .90 74 
Column proportions 44 74 


As noted earlier, an interaction is present in a set of data when the performance of subjects 
on one independent variable is not consistent across all the levels of another independent 
variable. Examination of Table 16.36 clearly suggests the presence of an interaction. 
Specifically, the following appears to be the case: Subjects are more likely to help a female 
confederate than a male confederate, but the proportion of females helped relative to the 
proportion of males helped is larger in the no noise condition than in the noise condition. We 
can also say that subjects are more likely to help the confederate in the no noise condition than 
the noise condition, but the proportion of subjects who help in the no noise condition relative to 
the noise condition is larger when the confederate is a female as opposed to a male. 

Tables such as Table 16.36, as well as graphs (such as Figures 27.1 and 27.2 employed to 
illustrate an interaction for a between-subjects factorial analysis of variance), can be extremely 
useful in providing a researcher with visual information regarding whether or not an interaction 
is present in a set of data.” It should be emphasized that in order to definitively establish the 
presence of an interaction, it is required that the appropriate inferential statistical statistic be 
conducted, and that the latter test yields a significant result for the interaction in question. 

As noted earlier, the analysis of multidimensional contingency tables is a complex subject, 
and the discussion of it in this section has been limited in nature. Among those sources that 
discuss the subject in greater detail are Christensen (1990), Everitt (1977, 1992), Fienberg 
(1980), Marascuilo and McSweeney (1977), Marascuilo and Serlin (1988), Wickens (1989), and 
Zar (1999). 


VIII. Additional Examples Illustrating the Chi-Square Test for 
r x c Tables 


Examples 16.8—16.11 are additional examples that can be evaluated with the chi-square test for 
r x c tables. 


Example 16.8 A researcher conducts a study to evaluate the relative problem-solving ability 
of male versus female adolescents. One hundred males and 80 females are randomly selected 
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from a population of adolescents. Each subject is given a mechanical puzzle to solve. The 
dependent variable is whether or not a person is able to solve the puzzle. Sixty out of the 100 
male subjects are able to solve the puzzle, while only 30 out of the 80 female subjects are able 
to solve the puzzle. Is there a significant difference between males and females with respect their 
ability to solve the puzzle? 


Table 16.37 summarizes the data for Example 16.8. Example 16.8 conforms to the require- 
ments of the chi-square test for homogeneity. This is the case, since there are two independent 
samples/groups (males versus females) which are dichotomized with respect to the following two 
categories on the dimension of problem-solving ability: solved puzzle versus did not solve 
puzzle. The grouping of subjects on the basis of gender represents a nonmanipulated independent 
variable, while the problem-solving performance of subjects represents the dependent variable. 
Note that the number of people for each gender is predetermined by the experimenter. Also note 
that it is not necessary to have an equal number of observations in the categories of the row 
variable, which represents the independent variable. Since the independent variable is 
nonmanipulated, if the chi-square analysis is significant it will only allow the researcher to 
conclude that a significant association exists between gender and one's ability to solve the puzzle. 
The researcher cannot conclude that gender is the direct cause of any observed differences in 
problem-solving ability between males and females. Employing Equation 16.2, the obtained 
chi-square value for Table 16.37 is X? = 9, which for df= 1 is greater than Xs - 3.84 and 
ren = 6.63. Thus, the null hypothesis can be rejected at both the .05 and .01 levels. Inspection 
of Table 16.37 reveals that a larger proportion of males are able to solve the puzzle than females. 


Table 16.37 Summary of Data for Example 16.8 


Did not solve 


Solved puzzle puzzle Row sums 
Males 60 40 100 
Females 30 50 80 
Total 
Column sums 90 90 observations 180 


Example 16.9 A pollster conducts a survey to evaluate whether Caucasians and African- 
Americans differ in their attitude toward gun control. Five hundred people are randomly 
selected from a telephone directory and called at 8 P.M. in the evening. An interview is 
conducted with each individual, at which time a person is categorized with respect to both race 
and whether one supports or opposes gun control. Table 16.38 summarizes the results of the 
survey. Is there evidence of racial differences with respect to attitude toward gun control? 


This example conforms to the requirements of the chi-square test of independence, since 
a single sample is categorized on two dimensions. The two dimensions subjects are categorized 
with respect to are race, for which there are the two categories Caucasian versus African- 
American, and attitude toward gun control, for which there are the two categories supports gun 
control versus opposes gun control. Note that neither the number of Caucasians or African- 
Americans (i.e., the sums of the rows), or the people who support gun control or oppose gun 
control (i.e., the sums of the columns) are predetermined prior to the pollster conducting the 
survey. The pollster selects a single sample of 500 subjects and categorizes them on both 
dimensions after the data are collected. The obtained chi-square value for Table 16.38 is 
x? = 59.91, which for df = 1 is greater than Xos = 3.84 and Xo = 6.63. Thus, the null 
hypothesis can be rejected at both the .05 and .01 levels. Inspection of Table 16.38 reveals that 
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a larger proportion of Caucasians opposes gun control, whereas a larger proportion of African- 
Americans supports gun control. 


Table 16.38 Summary of Data for Example 16.9 


Supports Opposes Row sums 
gun control gun control 
Caucasians 120 170 290 
Afro-Americans 160 50 210 
Total 
Column sums 280 220 observations 500 


Example 16.10 A researcher conducts a study on a college campus to examine the relationship 
between a student's class standing and the number of times a student visits a physician during 
the school year. Table 16.39 summarizes the responses of a random sample of 280 students 
employed in the study. Do the data indicate that the number of visits a student makes to a 
physician is independent of his or her class standing? 


Table 16.39 Summary of Data for Example 16.10 





0 visits 1-5 visits More than Row sums 
5 visits 
Freshman 60 
Sophomore 50 
Junior 90 
Senior 80 
Total 
Columnsums 119 67 94 observations 280 


This example conforms to the requirements of the chi-square test of independence, since a 
single sample is categorized on two dimensions. The two dimensions subjects are categorized with 
respect to are class standing, for which there are the four categories: Freshman, Sophomore, 
Junior, Senior, and the number of visits to a physician, for which there are the three categories: 
0 visits, 1—5 visits, more than 5 visits. Note that neither the sums of the rows nor the columns is 
predetermined by the researcher. The researcher randomly selects 280 subjects and categorizes 
each subject on both dimensions after the data are collected. Since the data for Example 16.10 are 
identical to that employed in Example 16.5, it yields the same result. The null hypothesis can be 
rejected, since the obtained value y? = 59.16 is significant at both the .05 and .01 levels. Thus, 
the researcher can conclude that a student's class standing and the number of visits one makes to 
a physician are not independent of one another (i.e., the two dimensions seem to be associated/ 
correlated with one another). As is the case with Example 16.5, a more detailed analysis of the data 
can be conducted through use of the comparison procedures described in Section VI. 


Example 16.11 A researcher conducts a study to evaluate whether or not a nationally 
acclaimed astrologer is able to match subjects with their correct sun sign. The astrologer and 
researcher agree to employ a format in which the astrologer views a five minute videotape of a 
subject who verbally responds to five open ended questions of a personal nature. Upon viewing 
the videotape, the astrologer indicates which of the 12 sun signs he believes the person's birth 
data falls within. Over a three-month period the astrologer views videotapes for 718 subjects. 
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Table 16.40 summarizes the results of the study. Since each column of the table corresponds to 
a subject's actual sun sign, and each row the sun sign selected by the astrologer, all correct 
responses by the astrologer appear in the 12 cells that constitute the diagonal of the table (i.e., 
any cell in which the row and column labels are identical). Does the performance of the 
astrologer indicate that he can reliably identify a person's sun sign? 


Table 16.40 Astrology Study Data 


Actual Sun Sign 


Selected Sagit- Capri- Column 
sun sign Aquarius Pisces Aries Taurus Gemini Cancer Leo Virgo Libra Scorpio tarius corn sums 
Aquarius 6 2 4 8 5 3 9 7 2 8 6 4 64 
Pisces 3 2 0 5 T 1 9 9 4 2 8 7 57 
Aries 0 9 3 1 7 6 2 1 9 8 9 4 59 
Taurus 9 3 4 3 4 8 1 8 5 6 2 6 59 
Gemini 8 8 2 6 3 9 4 6 4 1 3 9 63 
Cancer 6 6 4 7 7 7 2 4 6 3 8 1 61 
Leo 7 7 1 4 9 9 3 9 2 1 4 6 62 
Virgo 2 4 6 8 8 7 1 0 3 5 1 9 54 
Libra 8 8 8 8 3 6 2 9 0 6 2 4 64 
Scorpio 5 1 9 0 0 5 7 9 9 0 6 6 57 
Sagittarius 0 4 7 7 8 3 6 1 9 2 9 0 56 
Capricorn 7 0 5 9 5 2 0 7 9 8 6 4 62 
Rowsums 61 54 53 66 66 66 46 70 62 50 64 60 718 


Table 16.40 summarizes the results of the study in the format of a 12 x 12 contingency 
table. Since the table is based on a single sample of subjects who are categorized two times (i.e., 
each subject is categorized by the astrologer with regard to sun sign and categorized on the basis 
of one's actual birth date), the chi-square test of independence can be employed to evaluate the 
body of data contained within the whole table. However, the latter analysis will really not 
address the question of primary interest with any degree of precision. The simplest and most 
straightforward analysis will be to determine whether the number of correct responses by the 
astrologer is significantly above chance expectation. Thus, instead of evaluating the data with 
the chi-square test of independence, we will employ the binomial sign test for a single- 
sample, which will be used to evaluate the number of correct responses in the diagonal of the 
table. 

Equation 9.7 will be employed to evaluate the data.? If the astrologer is just guessing a 
person's sun sign, he has a one in 12 chance of being correct, and an 11 in 12 chance of being 
incorrect.” Thus, the expected number of correct responses will be the total number of responses 
(which corresponds to the total number of subjects/observations) multiplied by 1/12. Employing 
Equation 9.1, the expected number of correct responses is computed to be y = nn; =(718)(1/12) 
= 59.83. Equation 9.2 is employed to compute the standard deviation of the binomially 
distributed variable: o = Jnm,T, = /(718)(1/12)(11/12) = 7.41. Since the sum of the 12 
values in the diagonal of the Table 16.40 equals 40, the value x = 40 is employed to represent the 
number of correct responses in Equation 9.7. Substituting the appropriate values in the latter 
equation, the value z 2 —2.68 is computed. The negative z value is consistent with the fact that 
the number of correct responses is below chance expectation. Certainly the latter, in and of itself, 
invalidates the astrologer's claim that he can reliably identify a person's sun sign. 


,* 7",  40-(718(1/12) 40 - 59.83 


| Jamm, | /(718)0712)01/12) 7.41 


- -2.68 
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The obtained value z = — 2.68 is evaluated with Table A1 in the Appendix. In Table A1 
the tabled critical two-tailed .05 and .01 values are zo, = 1.96 and zy, = 2.58, and the tabled 
critical one-tailed .05 and .01 values are Z o, = 1.65 and z,, = 2.33. Since the absolute value 
z=2.68 is larger than all of the aforementioned critical values, both the nondirectional alternative 
hypothesis and the directional alternative hypothesis that is consistent with the data are 
supported. Put simply, the researcher will conclude that the astrologer's performance is 
significantly below chance. 

Given the fact that Table 16.40 is a 12 x 12 contingency table, there are numerous other 
analyses that can be conducted on the data. For instance, the researcher can examine the 
accuracy of the astrologer's responses within each of the sun signs. As an example, with respect 
to the sun sign Sagittarius, the astrologer correctly identified 9 of the 64 subjects who are, in fact, 
a Sagittarius. Since the chance probability of a correct response within a given sun sign is also 
1/12, it turns out that a score of 9 is significantly above chance (if a one-tailed analysis is con- 
ducted). On the other hand, the astrologer's scores of 0 for Virgo, Libra, and Scorpio are all 
significantly below chance. The point to be made here is that if one sifts through a large body 
of data, just by chance some results will be significant, and some of the significant results will 
be in the direction toward which a researcher is biased. Thus, if there is reason to believe that 
a specific element of the data will be significant, it should be specified beforehand. If the latter 
then turns out to be significant, that is quite different from finding the same significant difference 
after the fact. 
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Endnotes 


1. A general discussion of the chi-square distribution can be found under the single-sample 
chi-square test for a population variance. 


2. The use of the chi-square approximation (which employs a continuous probability 
distribution to approximate a discrete probability distribution) is based on the fact that the 
computation of exact probabilities requires an excessive amount of calculations. 


3. Inthe case of both the chi-square test of independence and the chi-square test of homog- 
eneity, the same result will be obtained regardless of which of the variables is designated 
as the row variable versus the column variable. 


4.  ]tis just coincidental that the number of introverts equals the number of extroverts. 


5. In the context of the discussion of the chi-square test of independence, the proportion of 
observations in Cell,, refers to the number of observations in Cell,, divided by the total 
number of observations in the 2 x 2 table. In the discussion of the hypothesis for the chi- 
square test for homogeneity, the proportion of observations in Cell,, refers to the number 
of observations in Cell,, divided by the total number of observations in Row 1 (i.e., the row 
in which Cell,, appears). 


6. Equation 16.2 is an extension of Equation 8.2 (which is employed to compute the value of 
chi-square for the chi-square goodness-of-fit test) to a two-dimensional table. In Equation 
16.2, the use of the two summation expressions XX indicates that the operations 
summarized in Table 16.4 are applied to all of the cells in the r x c table. In contrast, the 
single summation expression P in Equation 8.2 indicates that the operations summarized 


in Table 8.2 are applied to all k cells in a one-dimensional table. 


7.  Thesame chi-square value will be obtained if the row and column variables are reversed 
— i.e., the helping variable represents the row variable and the noise variable represents 
the column variable. 


8.  Correlational studies are discussed in detail under the Pearson product-moment corre- 
lation coefficient. 


9. The value Xos = 5.43 is determined by interpolation. It can also be derived by squaring 
the tabled critical one-tailed .01 value £u 2.33, since the square of the latter value is 
equivalent to the chi-square value at the 98th percentile. The use of z values in reference 
to a 2 x 2 contingency table is discussed later in this section under the z test for two 
independent proportions. 


© 2000 by Chapman & Hall/CRC 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


Within the framework of this discussion, the value Xs - 3.84 represents the tabled chi- 
square value at the 95th percentile (which demarcates the extreme 5% in the right tail of 
the chi-square distribution). Thus, using the format employed for the one-tailed .05 and .01 
values, the notation identifying the two-tailed .05 value can be written as Ys - 3.84. In 
the same respect, the value Xoi = 6.63 represents the tabled chi-square value at the 99th 
percentile (which demarcates the extreme 1% in the right tail of the chi-square distribution). 
Thus, using the format employed for the one-tailed .05 and .01 values, the notation 
identifying the two-tailed .01 value can be written as Xa = 6.63. 


The null and alternative hypotheses presented for the Fisher exact test in this section are 
equivalent to the alternative form for stating the null and alternative hypotheses for the chi- 
square test of homogeneity presented in Section III (if the hypotheses in Section III are 
applied to a 2 x 2 contingency table). 


Sourcebooks documenting statistical tables (e.g., Owen (1962) and Beyer (1968)), as well 
as many books that specialize in nonparametric statistics (e.g., Daniel (1990); Marascuilo 
and McSweeney (1977); Siegel and Castellan (1988)) contain tables of the hypergeometric 
distribution that can be employed with 2 x 2 contingency tables. Such tables eliminate the 
requirement of employing Equations 16.7/16.8 to compute the value of P}. 


The value (1 — p), which is often represented by the notation q, can also be computed as 
follows: g=(1—p)=(b + d)/(n, + n,) = (b + d)/n. The value q is a pooled estimate 
of the proportion of observations in Column 2 in the underlying population. 


Due to rounding off error there may be a minimal discrepancy between the square of a z 
value and the corresponding chi-square value. 


The logic for employing Equation 16.11 in lieu of Equation 16.9 is the same as that 
discussed in reference to the f test for two independent samples, when in the case of the 
latter test the null hypothesis stipulates a value other than zero for the difference between 
the population means (and Equation 11.5 is employed to compute the test statistic in lieu 
of Equations 11.1/11.2/11.3). 


The denominator of Equation 16.11 is employed to compute ERA instead of the denom- 
inator of Equation 16.9, since in computing a confidence interval it cannot be assumed that 
T, = T, (which is assumed in Equation 16.9, and serves as the basis for computing a 
pooled p value in the latter equation). 


The median test for independent samples can also be employed within the framework of 
the model for the chi-square test of independence. To illustrate this, assume that Example 
16.4 is modified so that the researcher randomly selects a sample of 200 subjects, and does 
not specify beforehand that the sample is comprised of 100 females and 100 males. If it 
just happens by chance that the sample is comprised of 100 females and 100 males, one can 
state that neither the sum of the rows nor the sum of the columns is predetermined by the 
researcher. As noted in Section I, when neither of the marginal sums is predetermined, the 
design conforms to the model for the chi-square test of independence. 


The word column can be interchanged with the word row in the definition of a complex 
comparison. 


© 2000 by Chapman & Hall/CRC 


19. 


20. 


21. 


22. 


23. 


24. 


25. 


26. 


Another consideration that should be mentioned with respect to conducting comparisons 
is that two or more comparisons for a set of data can be orthogonal (which means they are 
independent of one another), or comparisons can overlap with respect to the information 
they provide. As a general rule, when a limited number of comparisons is planned, it is 
most efficient to conduct orthogonal comparisons. The general subject of orthogonal com- 
parisons is discussed in greater detail in Section VI of the single-factor between-subjects 
analysis of variance. 


The null and alternative hypotheses stated below do not apply to the odds ratio. 


Some sources note that the phi coefficient can only assume a range of values between 0 
and +1. In these sources, the term |ad — bc| is employed in the numerator of Equation 
16.17. By employing the absolute value of the term in the numerator of Equation 16.17, 
the value of phi will always be a positive number. Under the latter condition the following 
will be true: @ = væn. 


In the case of small sample sizes, the results of the Fisher exact test are employed as the 
criterion for determining whether the computed value of phi is significant. 


In such a case, the data are summarized in the form of a 2 x 2 contingency table docu- 
menting the proportion of subjects who answer in each of the categories of two 
dichotomous variables (e.g., True versus False for both variables/test items). 


The reason why the result of the chi-square test for r x c tables is not employed to assess 
the significance of Q is because Q is not a function of chi-square. It should be noted that 
since Q is a special case of Goodman and Kruskal’s gamma, it can be argued that 
Equation 32.2 (the significance test for gamma) can be employed to assess whether or not 
Q is significant. However, Ott et al. (1992) state that a different procedure is employed for 
evaluating the significance of Q versus gamma. Equation 32.2 will not yield the same result 
as that obtained with Equation 16.21 when it is applied to a 2 x 2 table. If the gamma 
statistic is computed for Examples 16.1/16.2 it yields the absolute value y = .56 (y is the 
lower case Greek letter gamma), which is identical to the value of Q computed for the same 
set of data. (The absolute value is employed since the contingency table is not ordered, and 
thus, depending upon how the cells are arranged, a value of either +.56 or —.56 can be 
derived for gamma.) However, when Equation 32.2 is employed to assess the significance 
of y = .56, it yields the absolute value z = 3.51, which although significant at both the .05 and 
.01 levels is lower than the absolute value z = 5.46 obtained with Equation 16.21. 


An excellent discussion of odds can be found in Christensen (1990). 


a) One can also divide .439 by 1.5 and obtain the value o =.29. The latter value indicates 
that the odds of helping in the noise condition are .29 times as large as the odds of helping 
in the no noise condition. In the case of the data in Table 16.23, the value o = .29 indicates 
that the odds of contracting the disease in the washes hands condition are .29 times as 
large as the odds of contracting the disease in the does not wash condition; b) As noted 
earlier, when odds are employed what is often stated are the odds that an event will not 
occur. Using this definition, the odds that a person in the noise condition did not help the 
confederate (or that someone who washes her hands does not contract the disease) are 
2.33:1 (since (70/100)/(30/100) = 2.33). The odds that a person in the no noise condition 
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27. 


28. 


29. 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


did not help the confederate (or that someone who does not wash her hands does not 
contract the disease) are .667:1 or 2:3 (since (40/100)/(60/100) 2 .667). These values 
yield the same odds ratio, since 2.33/.667 = 3.49. 


Equation 16.31 is an alternate equation for computing the odds ratio. 


Py Pe 
0 = ——— 
Pa Pa 


(Equation 16.31) 


From Tables 16.2/16.23, we can determine that p, = a/n = 30/200 = .15, p, = b/n 
= 70/200 = .35, p, = c/n = 60/200 = .3, and p, = d/n = 40/200 = .2. Employing Equa- 
tion 16.31 with the data for Example 16.1, the value o = 3.5 is computed. 


"Ec PT 
(15)C2) 


Pagano and Gauvreau (1993) note that if the expected frequencies for any of the cells in the 
contingency table are less than 5, the equation below should be employed to compute the 
standard error. 





Yates’ correction for continuity was not used to compute the values y? = 11.58 and 
X5 = 2.09 for the two hospitals. If Yates’ correction is used, the computed chi-square 


values will be a little lower. 


Some sources (e.g., Christensen (1990)) employ the term factors (which is the term that is 
commonly employed within the framework of a factorial analysis of variance) to identify 
the different independent variables. 


Zar (1999) notes that the degrees of freedom are the sum of the degrees of freedom for 
all of the interactions. Specifically df = rcl - r- c-1* 22(r- D(c- 11 - 1) + 
(r- D(c- 1) + rc- D- 1) * (c - DX - 1). 





Zar (1999) notes that the degrees of freedom are the sum of the following: df = rcl — cl 
-r*l1z(r-1(c-1)1-1)*(r-1c- D (r- Dd- D. 





Based on Endnote 32, it logically follows that the degrees of freedom are the sum of the 
following: df = rcl - rl - c * 12 (r- 1)(c - D) — 1) + (c - D(r- 1) * (c - D) - 1). 





Based on Endnote 32, it logically follows that the degrees of freedom are the sum of the 
following: df= rcl - rc- lt 12 (r- 1Y(c- D(1— 1) € (- D(r- 1) € 0 - 1X - 1). 





Zar (1999) discusses and cites reference on mosaic displays for contingency tables, which 
represent an alternative to conventional graphs for visually summarizing the data in a con- 
tingency table. 


Since Equations 9.6 and 9.7 are equivalent, either one can be employed for the analysis. In 
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addition, since our analysis involves a binary situation (1.e., the two response categories of 
being correct or incorrect), the data can also be evaluated with the chi-square goodness- 
of-fit test. The chi-square value obtained with the latter test will be equal to the square of 
the z value obtained with Equation 9.6/9.7. 


37. In actuality, the number of days in each sun sign is not one-twelfth of the total number of 
days in a year (since 365 cannot be divided evenly by 12, it logically follows that the 
number of days in each sun sign will not be equal). In spite of the latter, for all practical 
purposes the value 1/12 can be accurately employed to represent the probability for each 
sun sign. 
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Inferential Statistical Tests Employed 
with Two Dependent Samples 
(and Related Measures of 
Association/Correlation) 


Test 17: The ¢ Test for Two Dependent Samples 
Test 18: The Wilcoxon Matched-Pairs Signed-Ranks Test 
Test 19: The Binomial Sign Test for Two Dependent Samples 


Test 20: The McNemar Test 
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Test 17 


The ¢ Test for Two Dependent Samples 
(Parametric Test Employed with Interval/Ratio Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Do two dependent samples represent two populations with 
different mean values? 


Relevant background information on test The ¢ test for two dependent samples, which is 
employed in a hypothesis testing situation involving two dependent samples, is one of a number 
of inferential statistical tests that are based on the f distribution (which is described in detail under 
the single-sample ¢ test (Test 2)). Throughout the discussion of the £ test for two dependent 
samples, the term experimental conditions will also be employed to represent the dependent 
samples employed in a study. In a dependent samples design, each subject either serves in all 
of the k (where k » 2) experimental conditions, or else is matched with a subject in each of the 
other (k — 1) experimental conditions (matching is discussed in Section VII).! In designs that are 
evaluated with the ¢ test for two dependent samples, the value of k will always equal 2. 

In conducting the ¢ test for two dependent samples, the means of the two experimental 
conditions (represented by the notations X, and X,) are employed to estimate the values of the 
means of the populations (y, and p, ) the conditions represent. If the result of the ¢ test for two 
dependent samples is significant, it indicates the researcher can conclude there is a high like- 
lihood that the two experimental conditions represent populations with different mean values. 
It should be noted that the ¢ test for two dependent samples is the appropriate test to employ 
for contrasting the means of two dependent samples when the values of the underlying popula- 
tion variances are unknown. In instances where the latter two values are known, the appropriate 
test to employ is the z test for two dependent samples (Test 17e), which is described in 
Section VI. 

The ¢ test for two dependent samples is employed with interval/ratio data, and is based 
on the following assumptions: a) The sample of n subjects has been randomly selected from the 
population it represents; b) The distribution of data in the underlying populations each of the 
experimental conditions represents is normal; and c) The third assumption, which is referred to 
as the homogeneity of variance assumption, states that the variance of the underlying population 
represented by Condition 1 is equal to the variance of the underlying population represented by 
Condition 2 (ie., 0? = 02). It should be noted that the f test for two dependent samples is 
more sensitive to violation of the homogeneity of variance assumption (which is discussed in 
Section VI) than is the ¢ test for two independent samples (Test 11). If any of the 
aforementioned assumptions of the f test for two dependent samples are saliently violated, the 
reliability of the test statistic may be compromised. 

When a study employs a dependent samples design, the following two issues related to 
experimental control must be taken into account: a) In a dependent samples design in which each 
subject serves in both experimental conditions, it is essential that the experimenter controls for 
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order effects (also known as sequencing or carryover effects). An order effect is where an 
obtained difference on the dependent variable is a direct result of the order of presentation of the 
experimental conditions, rather than being due to the independent variable manipulated by the 
experimenter. Order effects can be controlled through the use of a technique called counter- 
balancing, which is discussed in Section VII; and b) When a dependent samples design employs 
matched subjects, within each pair of matched subjects each of the two subjects must be 
randomly assigned to one of the two experimental conditions. Nonrandom assignment of 
subjects to the experimental conditions can compromise the internal validity of a study.” A 
more thorough discussion of matching can be found in Section VII. 


II. Example 


Example 17.1 A psychologist conducts a study to determine whether or not people exhibit more 
emotionality when they are exposed to sexually explicit words than when they are exposed to 
neutral words. Each of ten subjects is shown a list of 16 randomly arranged words, which are 
projected onto a screen one at a time for a period of five seconds. Eight of the words on the list 
are sexually explicit and eight of the words are neutral. As each word is projected on the screen, 
a subject is instructed to say the word softly to him or herself. As a subject does this, sensors 
attached to the palms of the subject's hands record galvanic skin response (GSR), which is used 
by the psychologist as a measure of emotionality. The psychologist computes two scores for each 
subject, one score for each of the experimental conditions: Condition 1: GSR/Explicit — The 
average GSR score for the eight sexually explicit words; Condition 2: GSR/Neutral — The 
average GSR score for the eight neutral words. The GSR/Explicit and the GSR/Neutral scores 
of the ten subjects follow. (The higher the score, the higher the level of emotionality.) Subject 
1 (9, 8); Subject 2 (2, 2); Subject 3 (1, 3); Subject 4 (4, 2); Subject 5 (6, 3); Subject 6 (4, 0); 
Subject 7 (7, 4); Subject 8 (8, 5); Subject 9 (5, 4); Subject 10 (1, 0)? Do subjects exhibit 
differences in emotionality with respect to the two categories of words? 


III. Null versus Alternative Hypotheses 


Null hypothesis Ay: by = m 


(The mean of the population Condition 1 represents equals the mean of the population Condition 
2 represents.) 


Alternative hypothesis Ay: py # m 


(The mean of the population Condition 1 represents does not equal the mean of the population 
Condition 2 represents. This is a nondirectional alternative hypothesis and it is evaluated with 
a two-tailed test. In order to be supported, the absolute value of t must be equal to or greater 
than the tabled critical two-tailed t value at the prespecified level of significance. Thus, either 
a significant positive f value or a significant negative t value will provide support for this 
alternative hypothesis.) 


Or 
H: m > By 
(The mean of the population Condition 1 represents is greater than the mean of the population 


Condition 2 represents. This is a directional alternative hypothesis and it is evaluated with a 
one-tailed test. It will only be supported if the sign of t is positive, and the absolute value of t 
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is equal to or greater than the tabled critical one-tailed t value at the prespecified level of sig- 
nificance.) 


or 


Ay: by < m 
(The mean of the population Condition 1 represents is less than the mean of the population 
Condition 2 represents. This is a directional alternative hypothesis and it is evaluated with a 
one-tailed test. It will only be supported if the sign of t is negative, and the absolute value of 
t is equal to or greater than the tabled critical one-tailed t value at the prespecified level of sig- 
nificance.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis (H,) is rejected.* 


IV. Test Computations 


Two methods can be employed to compute the test statistic for the f test for two dependent 
samples. The method to be described in this section, which is referred to as the direct- 
difference method, allows for the quickest computation of the t statistic. In Section VI, a 
computationally equivalent but more tedious method for computing t is described. 


Table 17.1 Data for Example 17.1 


Condition 1 Condition 2 
Subject X, X, D D? 
1 9 8 1 1 
2 2 2 0 0 
3 1 3 -2 4 
4 4 2 2 4 
5 6 3 3 9 
6 4 0 4 16 
7 7 4 3 9 
8 8 5 3 9 
9 5 4 1 1 
10 1 0 1 1 
EX cds XX si LD---2 Lp. 
l : ED-+ =18 
XDz16 
pf pestes 
10 10 


The data for Example 17.1 and the preliminary computations for the direct-difference 
method are summarized in Table 17.1. Note that there are n = 10 subjects, and that there is a 
total of 2n = (2)(10) = 20 scores, since each subject has two scores. The two scores of the 10 
subjects are listed in the columns of Table 17.1 labelled X, and X,. The score of a subject in the 
column labelled X, is the average GSR score of the subject for the eight sexually explicit words 
(Condition 1), while the score of a subject in the column labelled X, is the average GSR score 
of the subject for the eight neutral words (Condition 2). Column 4 of Table 17.1 lists a difference 
score for each subject (designated by the notation D), which is computed by subtracting a 
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subject's X, score from his X, score (i.e, D = X, - X,). Column 5 of the table lists a D? 
score for each subject, which is obtained by squaring a subject's D score. 

In Column 4 of Table 17.1, the summary value XD = 16 is obtained by adding /D+ = 18, 
the sum of the positive difference scores (i.e., all those difference scores with a + sign), and £D- 
= 2, the sum of the negative difference scores (1.e., all those difference scores with a — sign). 
The reader should take note of the fact that whenever XX, > XX, (and consequently 
X, > X,),the value XD will be a positive number, whereas whenever XX, < LX, (and con- 
sequently X, < X,), the value XD will be a negative number. 

Equation 17.1 is the direct-difference equation for computing the test statistic for the ¢ test 
for two dependent samples. 


je (Equation 17.1) 
5p 
Where D represents the mean of the difference scores 
Sp represents the standard error of the mean difference 
The mean of the difference scores is computed with Equation 17.2. 
p.iP (Equation 17.2) 
n 


Employing Equation 17.2, the value D = 1.6 is computed. 
Dae = 16 


Equation 17.3 is employed to compute §,, which represents the estimated population 
standard deviation of the difference scores.? 


(Equation 17.3) 





Equation 17.4 is employed to compute the value s5. The value 55 represents the stand- 
ard error of the mean difference, which is an estimated population standard deviation of mean 
difference scores.° 


(Equation 17.4) 
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Employing Equation 17.4, the value s5 = .56 is computed. 
1.78 


/10 


Substituting D - 16 and $5 = .56 in Equation 17.1, the value £ = 2.86 is computed." 


Sp = = .56 


t= 16 286 
.56 
The reader should take note of the fact that the values §, and s5, both of which are 
estimates of either a population standard deviation or the standard deviation of a sampling distri- 
bution, can never be a negative number. If a negative value is obtained for either of the afore- 
mentioned values, it indicates a computational error has been made. 


V. Interpretation of the Test Results 


The obtained value t = 2.86 is evaluated with Table A2 (Table of Student's t Distribution) in 
the Appendix. The degrees of freedom for the ¢ test for two dependent samples are computed 
with Equation 17.5. 


df=n-1 (Equation 17.5) 
Employing Equation 17.5, the value df= 10 — 1 = 9 is computed. The tabled critical two- 
tailed and one-tailed .05 and .01 t values for df= 9 are summarized in Table 17.2. (For a review 


of the protocol for employing Table A2, the reader should review Section V of the single- 
sample ¢ test.) 


Table 17.2 Tabled Critical .05 and .01 ¢ Values for df = 9 


los tor 
Two-tailed values 2.26 3.25 
One-tailed values 1.83 2.82 


The following guidelines are employed in evaluating the null hypothesis for the ¢ test for 
two dependent samples. 

a) If the nondirectional alternative hypothesis H,: pu, # u, is employed, the null hypothe- 
sis can be rejected if the obtained absolute value of t is equal to or greater than the tabled critical 
two-tailed value at the prespecified level of significance. 

b) If the directional alternative hypothesis H,: u, > m, is employed, the null hypothesis 
can be rejected if the sign of t is positive, and the value of t is equal to or greater than the tabled 
critical one-tailed value at the prespecified level of significance. 

c) If the directional alternative hypothesis H,: p, < p, is employed, the null hypothesis 
can be rejected if the sign of t is negative, and the absolute value of t is equal to or greater than 
the tabled critical one-tailed value at the prespecified level of significance. 

Employing the above guidelines, the following conclusions can be reached. 

The nondirectional alternative hypothesis H,: , * p, is supported at the .05 level, since 
the computed value t = 2.86 is greater than the tabled critical two-tailed value t9, = 2.26. The 
latter alternative hypothesis, however, is not supported at the .01 level, since t 2 2.86 is less than 
the tabled critical two-tailed value t$, = 3.25. 
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The directional alternative hypothesis H,: u, > p, is supported at both the .05 and .01 
levels, since the obtained value t = 2.86 is a positive number that is greater than the tabled critical 
one-tailed values t; = 1.83 and to, = 2.82. Note that when the directional alternative hypothe- 
sis Hi: p, > p is supported, itis required that X, > X,. 

The directional alternative hypothesis H,: p, < p, is not supported, since the ob- 
tained value t = 2.86 is a positive number. In order for the directional alternative hypothesis 
H: p, < p, to be supported, the computed value of t must be a negative number (as well as 
the fact that the absolute value of t must be equal to or greater than the tabled critical one-tailed 
value at the prespecified level of significance). In order for the data to be consistent with the 
directional alternative hypothesis H,: p, < p,, itis required that X, < X. 

A summary of the analysis of Example 17.1 with the ¢ test for two dependent samples 
follows: It can be concluded that the average GSR (emotionality) score for the sexually explicit 
words is significantly higher than the average GSR score for the neutral words. This result can 
be summarized as follows (if a = .05 is employed): (9) = 2.86, p < .05. 


VI. Additional Analytical Procedures for the t Test for Two 
Dependent Samples and/or Related Tests 


1. Alternative equation for the ¢ test for two dependent samples Equation 17.6 is an 
alternative equation that can be employed to compute the test statistic for the ¢ test for two 
dependent samples.* 
X, 7 X, : 
t= (Equation 17.6) 
2 2 
|S *ürs 2ry x) Gg Gg) 








The computation of t with Equation 17.6 requires more computations than does Equation 
17.1 (the direct difference method equation). Equation 17.6, unlike Equation 17.1, requires that 
the estimated population variance be computed for each of the samples (in Section VIit is noted 
that the latter values are required in order to evaluate the homogeneity of variance assumption 
of the ¢ test for two dependent samples). Since a total understanding of Equation 17.6 requires 
an understanding of the concept of correlation, the reader may find it useful to read Section I of 
the Pearson product-moment correlation coefficient (Test 28) prior to continuing this section. 

Except for the last term in the denominator of Equation 17.6 (i.e., 2ry x) Gg) ), the 
latter equation is identical to Equation 11.2 (the equation for the ¢ test for two independent 
samples when n, = n;). The value ry x, represents the coefficient of correlation between the 
two scores of subjects (or matched pairs of subjects) on the dependent variable. It is expected that 
as a result of using the same subjects in both conditions (or by employing matched subjects), a 
positive correlation will exist between pairs of scores (i.e., scores that are in the same row of Table 
17.1). The closer the value of ly. X, is to +1, the stronger the association between the scores of 
subjects on the dependent variable. (When r = +1, subjects who have a high score in Condition 
1 will have a comparably high score in Condition 2, and subjects who have a low score in 
Condition 1 will have a comparably low score in Condition 2.) As the value of "yx, approaches 
+1, the value of the denominator of Equation 17.6 decreases, which will result in an increase in 
the absolute value computed for t. Note that if Ty x, = 0, the denominator of Equation 17.6 
becomes identical to the denominator of Equation 11.2. Thus, if the scores of the n subjects under 
the two experimental conditions are not correlated with one another, the equation for the £ test for 
two dependent samples (as represented by Equation 17.6) reduces to Equation 11.2. 
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The intent of the above discussion is to illustrate that one advantage of employing a design 
that can be evaluated with the ¢ test for two dependent samples, as opposed to a design that is 
evaluated with the ¢ test for two independent samples, is that if there is a positive correlation 
between pairs of scores, the former test will provide a more powerful test of an alternative 
hypothesis than will the latter test. The greater power associated with the t test for two depen- 
dent samples is a direct result of the lower value that will be computed for the denominator of 
Equation 17.6 when contrasted with the denominator that will be computed for Equation 11.2 for 
the same set of data. In the case of both equations, the denominator is an estimated measure of 
variability in a sampling distribution. By employing pairs of scores that are positively correlated 
with one another, the estimated variability in the sampling distribution will be less than will be 
the case if the scores are not correlated with one another.? 

The computation of ¢ with Equation 17.6 will now be illustrated. Note that in order to 


compute ¢ with the latter equation, the following values are required: X,, X,, Sx Sg E ; Se ; 
1 
"x, In order to oe the estimated population variances and standard devon. the 


values XX and XX are required. The latter values are computed in Table 17.3. Employing 
the summary information provided in Table 17.1 and the values Ex? = 293 and XX = 147, 
all of the above noted values, with the exception of r are computed. 


ly x, 





Z = (89) = .79 sz = (75) = .56 


Equation 17.7 is employed to compute the value Ty x, 


UX (LX 
i VERE) 
Mx, > o (Equation 17.7) 
XX Yxy 
rp. CX Lo. C» 

















The only value in Equation 17.7 that is required to compute Ty x, which has not been 
computed for Example 17.1, is the term XX, X, in the numerator. The latter value, which is 
computed in Table 17.3, is obtained as icllews: Each subject's X, score is multiplied by the 
subject's X, score. The resulting score represents an X, X, score for the subject. The n X, X, 
scores are summed, and the resulting value represents the term XX, X, in Equation 17.7. 

Employing Equation 17.7, the value Ty x, = .78 is computed. 
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193 - M 
r = LL = .78 


XX, 
(p= ED um. e». 
10 


10 
Table 17.3 Computation of £X?, XX;, and XX,X, for Example 17.1 











Condition 1 Condition 2 

Subject X, x X, X X, X, 
1 9 81 8 64 72 
2 2 4 2 4 4 
3 1 1 3 9 3 
4 4 16 2 4 8 
5 6 36 3 9 18 
6 4 16 0 0 0 
7 7 49 4 16 28 
8 8 64 5 25 40 
9 5 25 4 16 20 
10 1 1 0 0 0 
EX =47 Xx = 295 X, =31 X; =147 EX, X, = 193 


When the relevant values are substituted in Equation 17.6, the value t = 2.86 is computed 
(which is the same value computed for t with Equation 17.1). Note that the value of the num- 
erator of Equation 17.6is X, - X, - D - 1.6. 


Paulo 47-31 -.-. . HR 


y.79 + .56 - 2(.78)(.89)(.75) 


In order to illustrate that the ¢ test for two dependent samples provides a more powerful 
test of an alternative hypothesis than the f test for two independent samples, the data for 
Example 17.1 will be evaluated with Equation 11.2 (which is the equation for the latter test). In 
employing Equation 11.2, the positive correlation that exists between the scores of subjects under 
the two experimental conditions will not be taken into account. Use of Equation 11.2 with the 
data for Example 17.1 assumes that in lieu of having n = 10 subjects, there are instead two 
independent groups, each group being comprised of 10 subjects. Thus, n, = 10, and the 10 X, 
scores in Table 17.1 represent the scores of the 10 subjects in Group 1, and n, = 10, and the 10 
X, scores in Table 17.1 represent the scores of the 10 subjects in Group 2. Since the values X,, 
X, sz , and s$ have already been computed, they can be substituted in Equation 11.2. When 
the relevant values are substituted in Equation 11.2, the value t = 1.37 is computed. 


X, -X - 
Qnare. . Jedi 012 
d d VT + 56 


Employing Equation 11.4, the degrees of freedom for Equation 11.2 are df= 10+ 10 -2 
= 18. In Table A2, the tabled critical two-tailed .05 and .01 values for df = 18 are ty; = 2.10 
and t, = 2.88, and the tabled critical one-tailed .05 and .01 values are tp; = 1.73 and ty, = 2.55. 
Since the obtained value f = 1.37 is less than all of the aforementioned critical values, the null 
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hypothesis cannot be rejected, regardless of whether a nondirectional or directional alternative 
hypothesis is employed. Recollect that when the same set of data is evaluated with Equations 
17.1/17.6, the nondirectional alternative hypothesis H,: pu, * p, is supported at the .05 level 
and the directional alternative hypothesis H,: p, > pm, is supported at both the .05 and .01 
levels. 

Note that by employing twice the degrees of freedom, the tabled critical values employed 
for Equation 11.2 will always be smaller than those employed for Equations 17.1/17.6. However, 
if there is a reasonably high positive correlation between the pairs of scores, the lower critical 
values associated with the f test for two independent samples will be offset by the fact that the 
t value computed with Equation 11.2 will be substantially smaller than the value computed with 
Equations 17.1/17.6. 

When the t test for two dependent samples is employed to evaluate a dependent samples 
design, it is assumed that a positive correlation exists between the scores of subjects in the 
two experimental conditions. It is, however, theoretically possible (although unlikely) that 
the two scores of subjects will be negatively correlated. If, in fact, the correlation between the 
X, and X, scores of subjects is negative, the value of the denominator of Equation 17.6 will 
actually be larger than will be the case if Equation 11.2 is employed to evaluate the same set of 
data (since in the denominator of Equation 17.6, if ry y is à Begouve number, the product 
2(ry X, X, Xs JG ) will be added to instead of subtracted fon ss + Sr 2 In addition, if the cor 
relation between the two scores is a very low positive value that i is dose to 0, the slight increment 
in the value of t computed with Equation 17.6 may be offset by the loss of degrees of freedom 
(and the consequent increase in the tabled critical t value), so as to allow Equation 11.2 (the t test 
for two independent samples) to provide a more powerful test of an alternative hypothesis than 
Equation 17.6 (the ¢ test for two dependent samples). 

In the unlikely event that in a dependent samples design there is a substantial negative 
correlation between subjects’ scores in the two experimental conditions, it is very unlikely that 
evaluation of the data with Equation 17.6 will yield a significant result. However, the presence 
of a significant negative correlation, in and of itself, can certainly be of statistical importance. 
When a negative correlation is present, subjects who obtain a high score in one experimental 
condition will obtain a low score in the other experimental condition, and vice versa. The closer 
the negative correlation is to —1, the more pronounced the tendency for a subjects’ scores in the 
two conditions to be in the opposite direction. To illustrate the presence of a negative correlation, 
consider the following example. Assume that employing a dependent samples design, each of five 
subjects who serve in two experimental conditions obtains the following scores: Subject 1 (1, 5); 
Subject 2 (2, 4); Subject 3 (3, 3); Subject 4 (4, 2); Subject 5 (5, 1). In this "potens example, 
it turns out that the correlation between the five pairs of scores is Ty x, 7 -1, which is the 
strongest possible negative correlation. Since, however, the mean and median value for both 
of the experimental conditions is equal to 3, evaluation of the data with the £ test for two 
dependent samples, as well as the more commonly employed nonparametric procedures em- 
ployed for a design involving two dependent samples (such as the Wilcoxon matched-pairs 
signed-ranks test (Test 18) and the binomial sign test for two dependent samples (Test 19)), 
will lead one to conclude there is no difference between the scores of subjects under the two 
conditions. This is the case, since such tests base the comparison of conditions on an actual or 
implied measure of central tendency. If it is assumed that in the above example the sample data 
accurately reflect what is true with respect to the underlying populations, it appears that the higher 
a subject's score in Condition 1, the lower the subject's score in Condition 2, and vice versa. 
In such a case, it can be argued that if the coefficient of correlation is statistically significant, 
that in itself can indicate the presence of a significant treatment effect (albeit an unusual one), 
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in that there is a significant association between the two sets of scores. The determination of 
whether a correlation coefficient is significant is discussed under the Pearson product-moment 
correlation coefficient, as well as in the discussion of a number of the other correlational pro- 
cedures discussed in the book." 


2. The equation for the ¢ test for two dependent samples when a value for a difference other 
than zero is stated in the null hypothesis If in stating the null hypothesis for the f test for two 
dependent samples a researcher stipulates that the difference between u; and p, is some value 
other than zero, Equation 17.8 is employed to evaluate the null hypothesis in lieu of Equation 
17.6." 

(X, = X) 5 (p, E Py) " 

£o So (Equation 17.8) 

2 2 
Sy t Sy o- (ry x Sz MSE) 


When the null hypothesis is Hy: pu, = p, (which as noted previously can also be written 
as Hy: p, - p, = 0), the value of (u, - p,) in Equation 17.8 reduces to zero, and thus what 
remains of the numerator in Equation 17.8 is (X, - X,) (which represents the numerator of 
Equation 17.6). In evaluating the value of t computed with Equation 17.8, the same protocol is 
employed that is described for evaluating a t value for the t test for two independent samples 
when the difference stated in the null hypothesis for the latter test is some value other than zero 
(in which case Equation 11.5 is employed to compute f). 


3. Test 17a: Thettest for homogeneity of variance for two dependent samples: Evaluation 
of the homogeneity of variance assumption of the ¢ test for two dependent samples Prior 
to reading this section, the reader should review the discussion of homogeneity of variance 
in Section VI of the ¢ test for two independent samples. As is the case with an independent 
samples design, in a dependent samples design the homogeneity of variance assumption evaluates 
whether there is evidence to indicate that an inequality exists between the variances of the popu- 
lations represented by the two experimental conditions. The null and alternative hypotheses 
employed in evaluating the homogeneity of variance assumption are as follows: 


Null hypothesis Hy o; = o 


(The variance of the population Condition 1 represents equals the variance of the population 
Condition 2 represents.) 


Alternative hypothesis H: o * o 


(The variance of the population Condition 1 represents does not equal the variance of the 
population Condition 2 represents. This is a nondirectional alternative hypothesis and it is 
evaluated with a two-tailed test. In evaluating the homogeneity of variance assumption, a non- 
directional alternative hypothesis is always employed.) 


The test that will be described in this section for evaluating the homogeneity of variance 
assumption is referred to as the ¢ test for homogeneity of variance for two dependent samples. 
The reader should take note of the fact that the F max test/F test for two population variances 
(Test 11a) (employed in evaluating the homogeneity of variance assumption for the ¢ test for two 
independent samples) is not appropriate to use with a dependent samples design, since it does 


not take into account the correlation between subjects’ scores in the two experimental conditions 
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(i.e., ry x) Equation 17.9 is the equation for the t test for homogeneity of variance for two 
dependent samples. 


,- GE - 89V - 2 


4 2 2 
4$; $s(1 - Ty x) 


(Equation 17.9) 





Where: $7 is the larger of the two estimated population variances 
Ss is the smaller of the two estimated population variances 


Since for Example 17.1 it has already been determined that $, = 2.83 and $, = 2.38, 
by squaring the latter values we can determine the values of the estimated population variances. 
Thus: & = (2.83? = 8.01 = 82 and & - (2.38) - 5.66 - $2. Substituting the appropri- 
ate values in Equation 17.9, the value f = .79 is computed. 


ps (8.01 - 5.66.00 -2) _ 79 
/4(8.01)(5.66)(1 - (.78))) 


The degrees of freedom to employ for evaluating the t value computed with Equation 17.9 
are computed with Equation 17.10. 


df=n-2 (Equation 17.10) 


Employing Equation 17.10, the degrees of freedom for the analysis are df= 10 — 2 = 8. 
For df = 8, the tabled critical two-tailed .05 and .01 values in Table A2 are tọ; = 2.31 and 
to, = 3.35. In order to reject the null hypothesis the obtained value of t must be equal to or 
greater than the tabled critical value at the prespecified level of significance. Since the value t 
= .79 is less than both of the aforementioned critical values, the null hypothesis cannot be 
rejected. Thus, the homogeneity of variance assumption is not violated. 

There are a number of additional points that should be noted with respect to the ¢ test for 
homogeneity of variance for two dependent samples. 

a) Unless 5 =5 n (in which case t = 0), Equation 17.8 will always yield a positive t value. 
This is the case, since in the numerator of the equation the smaller variance is subtracted from 
the larger variance. 

b) In some sources Equation 17.9 is written in the form of Equation 17.11, which employs 
the notation s? and 5 in place of 82 and ss , 


se 67 - £540 - 2) 


wd wd 2 
457 s (l - Ty x) 


(Equation 17.11) 





If Equation 17.11 is employed, the computed value of t can be a negative number. 
Specifically, t will be negative when x > s? . In point of fact, the sign of t is irrelevant, unless 
one is evaluating a directional alternative hypothesis. Since the homogeneity of variance 
assumption involves evaluation of a nondirectional alternative hypothesis, when Equation 17.11 
is employed, the researcher is only interested in the absolute value of t. 

C) As is the case for a test of homogeneity of variance for two independent samples, it is 
possible to use the f test for homogeneity of variance for two dependent samples to evaluate 
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a directional alternative hypothesis regarding the relationship between the variances of two 
populations. Thus, if a researcher specifically predicts that the variance of the population rep- 
resented by Condition 1 is larger than the variance of the population represented by Condition 2 
(Le., Hi: o? > 02), or that the variance of the population represented by Condition 1 is smaller 
than the variance of the population represented by Condition 2 (i.e., H: o? « o ), the latter pair 
of directional alternative hypotheses can be evaluated with Equation 17.11. In such a case, the 
sign of the computed t value is relevant. If one employs Equation 17.11 to evaluate a directional 
alternative hypothesis, the following guidelines are employed in evaluating the null hypothesis: 

1) If the directional alternative hypothesis H,: 0; > o is employed, the null hypothesis 
can be rejected if the obtained absolute value of t is positive, and the value of t is equal to or 
greater than the tabled critical one-tailed value at the prespecified level of significance. 

2) If the directional alternative hypothesis H,: o < 0; is employed, the null hypothesis 
can be rejected if the sign of t is negative, and the absolute value of t is equal to or greater than 
the tabled critical one-tailed value at the prespecified level of significance. 

If the directional alternative hypothesis H;: o > a; is evaluated for Example 17.1, the 
tabled critical one-tailed .05 and .01 values employed for the analysis are to; = 1.86 and to; 
= 2.90 (which respectively correspond to the tabled values at the 95th and 99th percentiles). 
Although the data are consistent with the directional alternative hypothesis H;: o; > o2, the 
null hypothesis cannot be rejected, since the obtained value t = .79 is less than the 
aforementioned one-tailed critical values. (t = .79 is obtained with Equation 17.11, since as 
previously noted, s? = $; = 8.01 and & - s; - 5.66.) 

d) Equation 17.12 is an alternative but equivalent form of Equation 17.11. 


jig M AES 2) (Equation 17.12) 


AF - ry x) 


In Equation 17.12, the value of F is computed with Equation 11.8 (F - rA / $2, which is 
described under the f test for two independent samples). Substituting the appropriate values 
in Equation 17.12, the value t = .79 is computed. 


oo 


pe SOV Lag 242 DVO. a5 
5.6 2/(.42(1 - .78) 


oN 


Note that Equation 17.12 can only yield a positive f value. 

e) Equation 17.13 is an alternative but equivalent form of Equation 17.9 that can only be 
employed to evaluate a nondirectional alternative hypothesis. Since s? = s = 8.01 and 
82 = $; = 5.66, Equation 17.13 yields the value t = .79 (since Fax = 8.01/5.66 = 1.42). 


F - Dy(n - 2 
t = Gu 7 DV - 2) (Equation 17.13) 


ZJF nall = Ty x) 


Where: FL. = $2182 (which is Equation 11.6) 

All of the equations noted in this section for the ¢ test for homogeneity of variance for 
two dependent samples are based on the following two assumptions: a) The samples have been 
randomly drawn from the populations they represent; and b) The distribution of data in the 
underlying population each of the samples represents is normal. It is noted in the discussion of 


© 2000 by Chapman & Hall/CRC 


homogeneity of variance in Section VI of the t test for two independent samples, that violation 
of the normality assumption can severely compromise the reliability of certain tests of homog- 
eneity of variance. The f test for homogeneity of variance for two dependent samples is 
among those tests whose reliability can be compromised if the normality assumption is violated. 

The problems associated with the use of the ¢ test for two independent samples when the 
homogeneity of variance assumption is violated are also applicable to the f test for two depen- 
dent samples. Thus, if the homogeneity of variance assumption is violated, it will generally 
inflate the Type I error rate associated with the ¢ test for two dependent samples. The reader 
should take note of the fact that when the homogeneity of variance assumption is violated with 
a dependent samples design, its effect on the Type I error rate will be greater than for an 
independent samples design. In the event the homogeneity of variance assumption is violated 
for a dependent samples design, either of the following strategies can be employed: a) In 
conducting the ¢ test for two dependent samples, the researcher can run a more conservative 
test. Thus, if the researcher does not want the Type I error rate to be greater than .05, instead of 
employing f; as the tabled critical value, she can employ f, to represent the latter value; or b) 
In lieu of the ¢ test for two dependent samples, a nonparametric test that does not assume 
homogeneity of variance can be employed to evaluate the data (such as the Wilcoxon matched- 
pairs signed-ranks test). 


4. Computation of the power of the ¢ test for two dependent samples and the application 
of Test 17b: Cohen's d index In this section the two methods for computing power that are 
described for computing the power of the ¢ test for two independent samples will be extended 
to the t test for two dependent samples. Prior to reading this section, the reader may find it 
useful to review the discussion of power for both the single-sample f test and the ¢ test for two 
independent samples. 

The first procedure to be described is the graphical method which reveals the logic under- 
lying the power computations for the ¢ test for two dependent samples. In the discussion to 
follow, it will be assumed that the null hypothesis is identical to that employed for Example 17.1 
(Le. Hy: p, - p, = 0, which, as previously noted, is another way of writing Hy: p, = m). 
It will also be assumed that the researcher wants to evaluate the power of the ¢ test for two 
dependent samples in reference to the following alternative hypothesis: H,: |“, - m| > 1.6 
(which is the difference obtained between the means of the two experimental conditions in 
Example 17.1). In other words, it is predicted that the absolute value of the difference between 
the two means is equal to or greater than 1.6. The latter alternative hypothesis is employed in 
lieu of H,: p, - p, * O (which can also be written as H,: pu, * p), since in order to compute 
the power of the test, a specific value must be stated for the difference between the population 
means. Note that, as stated, the alternative hypothesis stipulates a nondirectional analysis, since 
it does not specify which of the two means will be the larger value. It will be assumed that 
a = .05 is employed in the analysis. 

Figure 17.1, which provides a visual summary of the power analysis, is comprised of two 
overlapping sampling distributions of difference scores. The distribution on the left, which will 
be designated as Distribution A, is a sampling distribution of difference scores that has a mean 
value of zero (i.e., ug = My x, = 0). This latter value will be represented by Mp, = 0 in 
Figure 17.1. Distribution A represents the sampling distribution that describes the distribution 
of difference scores if the null hypothesis is true. The distribution on the right, which will be 
designated as Distribution B, is a sampling distribution of difference scores that has a mean value 
of 1.6 (i.e, up = My x? 1.6). This latter value will be represented by Mp, = 1.6 in Figure 
17.1. Distribution B represents the sampling distribution that describes the distribution of 
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difference scores if the alternative hypothesis is true. Each of the sampling distributions has a 
standard deviation that is equal to the value computed for $5 = .56, the estimated standard 
error of the mean difference, since the latter value provides the best estimate of the standard devi- 
ation of the mean difference in the underlying populations. 

In Figure 17.1, area (///) delineates the proportion of Distribution A that corresponds to the 
value a/2, which equals .025. This is the case, since a = .05 and a two-tailed analysis is con- 
ducted. Area (=) delineates the proportion of Distribution B that corresponds to the probability 
of committing a Type II error (B). Area (\\\) delineates the proportion of Distribution B that 
represents the power of the test (i.e., 1 — B). 

The procedure for computing the proportions documented in Figure 17.1 will now be de- 
scribed. The first step in computing the power of the test requires one to determine how large 
a difference there must be between the sample means in order to reject the null hypothesis. In 
order to do this, we algebraically transpose the terms in Equations 17.1/17.6, using s5 to 
summarize the denominator of the equation, and 1, (the tabled critical two-tailed .05 t value) 
to represent f. Thus: X, - X, = (t9 (sg). By substituting the values 7 ,,— 2.26 and s5 = .56 
in the latter equation, we determine that the minimum required difference is 
X, - X, = (2.26)(.56) = 1.27 (which is represented by the notation X, = 1.27 in Figure 
17.1). Thus, any difference between the two population means that is equal to or greater than 
1.27 will allow the researcher to reject the null hypothesis at the .05 level. 


Distribution A Distribution B 
B = .28 Power=1-B=.72 
M 




















SN 
4 x 0/2 = .025 
E SN WN = 
Mp, = 9 Xp =1.27 Up 71.6 
[120 t 22.26 €- Distribution A 
t =-0.59 rt=0 € Distribution B 


Figure 17.1 Visual Representation of Power for Example 17.1 


The next step in the analysis requires one to compute the area in Distribution B that falls 
between the mean difference p D, ^ 1.6 (i.e., the mean of Distribution B) and a mean difference 
equal to 1.27 (represented by the notation X, = 1.27 in Figure 17.1). This is accomplished by 
employing Equations 17.1/17.6. In using the latter equation, the value of X, is represented by 
1.27 and the value of X, by Mp, = 1.6. 


X-X 127-16. 


t = -.59 
5p .56 
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By interpolating the values listed in Table A2 for df= 9, we determine that the proportion 
of Distribution B that falls between the mean and a t score of —59 (which corresponds to a mean 
difference of 1.27) is approximately .22. The latter area plus the 50% of Distribution B to the 
right of the mean corresponds to area (W) in Distribution B. Note that the left boundary of area 
(\\\) is also the boundary delineating the extreme 2.5% of Distribution A (1.e., a/2 = .025, which 
is the rejection zone for the null hypothesis). Since area (\\\) in Distribution B overlaps the 
rejection zone in Distribution A, area (\\\) represents the power of the test — i.e., it represents 
the likelihood of rejecting the null hypothesis if the alternative hypothesis is true. The power of 
the test is obtained by adding .22 and .5. Thus, the power of the test equals .72. The likelihood 
of committing a Type II error (f) is represented by area (=), which comprises the remainder of 
Distribution B. The proportion of Distribution B that constitutes this latter area is determined 
by subtracting the value .72 from 1. Thus: B = 1- .72 = 28. 

Based on the results of the power analysis, we can state that if the alternative hypothesis 
Hy, |u; - m| = 1.6 is true, the likelihood that the null hypothesis will be rejected is .72, and 
at the same time, there is a .28 likelihood that it will be retained. If the researcher considers the 
computed value for power too low (which in actuality should be determined prior to conducting 
a study), she can increase the power of the test by employing a larger sample size. 

Method 2 for computing the power of the ¢ test for two dependent samples employing 
Test 17b: Cohen'sd index Method 2 described for computing the power of the t test for two 
independent samples can also be extended to the ¢ test for two dependent samples. In using 
this latter method, the researcher must stipulate an effect size (4), which in the case of the £ test 
for two dependent samples is computed with Equation 17.14. The effect size index computed 
with Equation 17.14 was developed by Cohen (1977, 1988), and is known as Cohen's d index. 
Further discussion of Cohen's d index can be found in Section IX (the Appendix) of the 
Pearson product-moment correlation coefficient under the discussion of meta-analysis and 
related topics. 


|t 7 [A 
o 


d = (Equation 17.14) 


D 


The numerator of Equation 17.14 represents the hypothesized difference between the two 
population means. As is the case with the graphical method described previously, when a power 
analysis is conducted after the mean of each sample has been obtained, the difference between 
the two sample means (i.e., X, - X,) is employed as an estimate of the value of |y, - j|. In 
Equation 17.14, the value of 6, represents the standard deviation of the difference scores in 
the population. In order to compute the power of the ¢ test for two dependent samples, the 
latter value must either be known or be estimated by the researcher. If power is computed after 
the sample data have been collected, one can employ the value computed for $, to estimate the 
value of op. Thus, in the case of Example 17.1 we can employ $5 = 1.78 as an estimate of o. 

It should be noted that if one computes the power of a test prior to collecting the data 
(which is what a researcher should ideally do) most researchers will have great difficulty coming 
up with a reasonable estimate for the value of oj. Since a researcher is more likely to be able 
to estimate the values of o, and o, (i.e., the population standard deviation for each of the 
experimental conditions), if it can be assumed that o, = o, (which is true if the population 
variances are homogeneous), the value of o, can be estimated with Equation 17.15.” 


op = 0,/2(1 - Px, x) (Equation 17.15) 


Where: Py y is the correlation between the two variables in the underlying populations 


1 
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Since the effect size computed with Equation 17.14 is only based on population parameters, 
it is necessary to convert the value of d into a measure that takes into account the size of the 
sample (which is a relevant variable in determining the power of the test). This measure, as 
noted in the discussions of the single-sample t test and the f test for two independent samples, 
is referred to as the noncentrality parameter. Equation 17.16 is employed to compute the non- 
centrality parameter (5) for the ¢ test for two dependent samples. 


6 = dyn (Equation 17.16) 


The power of the ¢ test for two dependent samples will now be computed using the data 
for Example 17.1. For purposes of illustration, it will be assumed that the minimum difference 
between the population means the researcher is trying to detect is the observed 1.6 point differ- 
ence between the two sample means — i.e., |X, - X,| = |4.7-3.1| = 1.6 = |u; - m|. The 
value of o; that will be employed in Equation 17.14 is $p = 1.78 (which is the estimated value 
of the population parameter computed for the sample data). Substituting |u; - p| = 1.6 and 
Op = 1.78 in Equation 17.14, the value d = .90 is computed. 





Cohen (1977; 1988, pp. 24—27) has proposed the following (admittedly arbitrary) d values 
as criteria for identifying the magnitude of an effect size: a) A small effect size is one that is 
greater than .2 but not more than .5 standard deviation units; b) A medium effect size is one that 
is greater than .5 but not more than .8 standard deviation units; and c) A large effect size is 
greater than .8 standard deviation units. Employing Cohen's (1977, 1988) guidelines, the value 
d = .90 (which represents .90 standard deviation units) is categorized as a large effect size. 

Along with the value n = 10, the value d = .90 is substituted in Equation 17.16, resulting 
in the value 6 = 2.85. 


6 = .90 /10 = 2.85 


The value à = 2.85 is evaluated with Table A3 (Power Curves for Student's t Dis- 
tribution) in the Appendix. We will assume that for the example under discussion a two-tailed 
test is conducted with a = .05, and thus Table A3-C is the appropriate set of power curves to 
employ for the analysis. Since there is no curve for df= 9, the power of the test will be based on 
a curve that falls between the df= 6 and df= 12 power curves. Through interpolation, the power 
of the £ test for two dependent samples is determined to be approximately .72 (which is the 
same value that is obtained with the graphical method). Thus, by employing 10 subjects the 
researcher has a probability of .72 of rejecting the null hypothesis if the true difference between 
the population means is equal to or greater than .90 op units. (which in Example 17.1 is 
equivalent to a 1.6 point difference between the means). 

As long as a researcher knows or is able to estimate the value of op, by employing trial 
and error she can substitute various values of n in Equation 17.16, until the computed value of 
ò corresponds to the desired power value for the f test for two dependent samples for a given 
effect size. This process can be facilitated by employing tables developed by Cohen (1977, 
1988) which allow one to determine the minimum sample size necessary in order to achieve a 
specific level of power in reference to a given effect size. 
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5. Measure of magnitude of treatment effect for the ¢ test for two dependent samples: 
Omega squared (Test 17c) Prior to reading this section, the reader should review the 
discussion of magnitude of treatment effect and the omega squared statistic in Section VI of 
the ¢ test for two independent samples. In the latter discussion, it is noted that the computation 
of a t value only provides a researcher with information concerning the likelihood of the null 
hypothesis being false, but does not provide information on the magnitude of any treatment effect 
that is present. As noted in the discussion of the ¢ test for two independent samples, a treatment 
effect is defined as the proportion of the variability on the dependent variable that is associated 
with the experimental treatments/conditions. As is the case with the £ test for two independent 
samples, the magnitude of a treatment effect for the ¢ test for two dependent samples can be 
estimated with the omega squared statistic ( à). 

In order to compute the appropriate omega squared statistic for the £ test for two depen- 
dent samples, it is necessary to obtain additional values that have not been computed for 
Example 17.1. The latter values are obtained within the framework of conducting a single-factor 
within-subjects analysis of variance (Test 24), which is an alternative procedure that can be 
employed to evaluate the data for Example 17.1 (yielding equivalent results). The derivation of 
the relevant values for computing omega squared (which are summarized in Table 17.4) is 
described in the discussion of the single-factor within-subjects analysis of variance." 


Table 17.4. Summary Table of Single-Factor Within-Subjects Analysis 
of Variance for Example 17.1 


Source of variation SS df MS F 
Between-subjects 108.8 9 12.09 
Between-conditions 12.8 1 12.80 8.11 
Residual 14.2 9 1.58 

Total 135.8 19 


Keppel (1991) and Kirk (1995) note that there is disagreement with respect to which of the 
components derived in the analysis of variance should be employed in computing omega 
squared for a within-subjects design. One method of computing omega squared (which 
computes a value referred to as standard omega squared) was employed in the previous edition 
of this book. The latter method expresses treatment (i.e., between-conditions) variability as a 
proportion of the sum of all the elements that account for variability in a within-subjects design. 
Another method of computing omega squared is referred to as partial omega squared. The 
latter measure, which Keppel (1991) and Kirk (1995) view as more meaningful than standard 
omega squared, ignores between-subjects variability, and expresses treatment (1.e., between- 
conditions) variability as a proportion of the sum of between-conditions, and residual variability. 
For a set of data, the value computed for partial omega squared will always be larger than the 
value computed for standard omega squared. 

Equation 17.17 is employed to compute partial omega squared (&). Since in Equation 
17.17 kequals the number of experimental conditions, in the case of the t test for two dependent 
samples k will always equal 2. 


miss (k - (Fee - 1) 
P^ (k- DG - 1) + nk 
(Equation 17.17) 


d (2 - 1X(8.11 = 1) 
P 2 ~ 198.11 - D +002) 
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The value à = .372 computed for partial omega squared indicates 37.2% of the 
variability on the dependent variable (galvanic skin response) is associated with variability on 
the different levels of the independent variable (sexually explicit versus neutral words). In the 
previous edition of this book the value computed for omega squared was .08, which, as noted 
earlier, represents standard omega squared. The fact that .08 is less that .372 is consistent with 
the fact that the value of standard omega squared will always be smaller than the value of 
partial omega squared. 

It is noted in an earlier discussion of omega squared (in Section VI of the ¢ test for two 
independent samples) that Cohen (1977; 1988, pp. 285-288) has suggested the following (ad- 
mittedly arbitrary) values, which are employed in psychology and a number of other disciplines, 
as guidelines for interpreting ©”: a) A small effect size is one that is greater than .0099 but not 
more than .0588; b) A medium effect size is one that is greater than .0588 but not more than 
.1379; and c) A large effect size is greater than .1379. If Cohen's (1977, 1988) guidelines 
are employed, the value à - .372 itis categorized as a large effect size. If the value .08 com- 
puted for standard omega squared is employed, it is categorized as a medium effect size. 

A full discussion of the computation of an omega squared value for a within-subjects 
design can be found in Section VI of the single-factor within-subjects analysis of variance, as 
well as in Keppel (1991) and Kirk (1995). The point-biserial correlation coefficient (Test 28h) 
(755), which is another magnitude of treatment effect that can be employed for the ¢ test for two 
dependent samples, is discussed in Section IX (the Appendix) of the Pearson product- 
moment correlation coefficient under the discussion of bivariate measures of correlation that 
are related to the Pearson product-moment correlation coefficient, and in the discussion of 
meta-analysis and related topics. 


6. Computation of a confidence interval for the ¢ test for two dependent samples Prior to 
reading this section the reader should review the discussion of the computation of confidence 
intervals in Section VI of the single-sample ¢ test and the f test for two independent samples. 
When interval/ratio data are available for two dependent samples, a confidence interval can be 
computed that identifies a range of values within which one can be confident to a specified 
degree that the true difference lies between the two population means. Equation 17.18 is the 
general equation for computing the confidence interval for the difference between two dependent 
population means." 


Cl, = &, - X) € (Gg (Equation 17.18) 


a - a) 


Where: t represents the tabled critical two-tailed value in the f distribution, for df = n — 1, 
below which a proportion (percentage) equal to [1 — (a/2)] of the cases falls. If the 
proportion (percentage) of the distribution that falls within the confidence interval is 
subtracted from 1 (100%), it will equal the value of a. 


Employing Equation 17.18, the 95% interval for Example 17.1 is computed below. In 
employing Equation 17.18, (X, - X,) represents the obtained difference between the means of 
the two conditions (which is the numerator of the equation used to compute the value of t), tos 
represents the tabled critical two-tailed .05 value for df = n — 1, and s represents the stand- 
ard error of the mean difference (which is the denominator of the equation used to compute the 
value of f). 
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Clg = (X, - X) + (tj = 1.6 + (2.26)(.56) = 1.6 + 127 
33 < (uj - m) < 2.87 


This result indicates that the researcher can be 95% confident (or the probability is .95) that 
the true difference between the population means falls within the range .33 and 2.87. 
Specifically, it indicates that one can be 95% confident (or the probability is .95) that the mean 
of the population Condition 1 represents is greater than the mean of population that Condition 
2 represents by at least .33 words but not by more than 2.87 words. 

The 99% confidence interval for Example 17.1 will also be computed to illustrate that the 
range of values that define a 99% confidence interval is always larger than the range which 
defines a 95% confidence interval. 


Cl, = X, - X) + Gods) = 1.6 + 3.25056) = 1.6 + 1.82 
-.22 < (u, - m) < 3.42 


Thus, the researcher can be 99% confident (or the probability is .99) that the true difference 
between the population means falls within the range —.22 and 3.42. Specifically, it indicates that 
one can be 99% confident (or the probability is .99) that the mean of the population Condition 
2 represents is no more than .22 words higher than the mean of population that Condition 1 
represents, and that the mean of population that Condition 1 represents is no more than 3.42 
words higher than the mean of population that Condition 2 represents. The reader should take 
note of the fact that the reliability of Equation 17.18 will be compromised if one or more of the 
assumptions of the ¢ test for two dependent samples are saliently violated. 


7. Test 17d: Sandler's A test Sandler (1955) derived a computationally simpler procedure, 
referred to as Sandler’s A test, which is mathematically equivalent to the ¢ test for two depen- 
dent samples. The test statistic for Sandler's A test is computed with Equation 17.19. 


2 
A. ED (Equation 17.19) 


| (EDY 





Note that in Equation 17.19, X:D and XD? are the same elements computed in Table 17.1 
which are employed for the direct difference method for the t test for two dependent samples. 
When Equation 17.19 is employed for Example 17.1, the value A = .211 is computed. 


A= 


E = .211 
(16% 





The reader should take note of the fact that except for when XD = 0, the value of A must 
be a positive number. If a negative value is obtained for A, it indicates that a computational error 
has been made. If XD = 0 (which indicates that the means of the two conditions are equal), 
Equation 17.19 becomes unsolvable. 


Table 17.5 Tabled Critical .05 and .01 A values for df = 9 


t o5 fo 
Two-tailed values .276 .185 
One-tailed values .368 .213 
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The obtained value A = .211 is evaluated with Table A12 (Table of Sandler’s A Statistic) 
in the Appendix. As is the case for the ¢ test for two dependent samples, the degrees of 
freedom employed for Sandler's A test are computed with Equation 17.5. Thus, df= 10 — 1 = 
9. The tabled critical two-tailed and one-tailed .05 and .01 values for df= 9 are summarized in 
Table 17.5. 

The following guidelines are employed in evaluating the null hypothesis for Sandler's A 
test. 

a) If the nondirectional alternative hypothesis H,: pu, * p, is employed, the null hypothe- 
sis can be rejected if the obtained value of A is equal to or less than the tabled critical two-tailed 
value at the prespecified level of significance. 

b) If the directional alternative hypothesis H,: p, > m, is employed, the null hypothesis 


can be rejected if the sign of XD is positive (i.e., X, > X,), and the value of A is equal to or 
less than the tabled critical one-tailed value at the prespecified level of significance. 
c) If the directional alternative hypothesis H,: 4, < m, is employed, the null hypothe- 


sis can be rejected if the sign of LD is negative (i.e., X, < X,), and the value of A is equal to 
or less than the tabled critical one-tailed value at the prespecified level of significance. 

Employing the above guidelines, the following conclusions can be reached. 

The nondirectional alternative hypothesis H,: p, # p, is supported at the .05 level, 
since the computed value A = .211 is less than the tabled critical two-tailed value A ọ5 = .276. 
The latter alternative hypothesis, however, is not supported at the .01 level, since A = .211 is 
greater than the tabled critical two-tailed value A ;, = .185. 

The directional alternative hypothesis H,: pu, > p, is supported at both the .05 and .01 
levels, since 3D = 16isa positive number, and the obtained value A = .211 is less than the tabled 
critical one-tailed values A); = .368 and A, = .213. 

The directional alternative hypothesis H,: p, < m, is not supported, since LD = 16 isa 
positive number. In order for the directional alternative hypothesis H,: p, < p to be sup- 
ported, the value of XD must be a negative number (as well as the fact that the computed value 
of A must be equal to or less than the tabled critical one-tailed value at the prespecified level of 
significance). 

Note that the results obtained for Sandler’s A test are identical to those obtained when the 
t test for two dependent samples is employed to evaluate Example 17.1. Equation 17.20 de- 
scribes the relationship between Sandler’s A statistic and the t value computed for the t test for 
two dependent samples.'° 





C (Equation 17.20) 
n 


It is demonstrated below that when t = 2.86 (the value computed with Equations 
17.1/17.6) is substituted in Equation 17.20, it yields the value A = .211 computed with Equation 
17.19. 


MEN] 


— (100.860 10 


8. Test 17e: The z test for two dependent samples There are occasions (albeit infrequent) 
when a researcher wants to compare the means of two dependent samples, and happens to know 
the variances of the two underlying populations. In such a case, thez test for two dependent 
samples should be employed to evaluate the data instead of the ¢ test for two dependent 
samples. As is the case with the latter test, the z test for two dependent samples assumes that 
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the two samples are randomly selected from populations that have normal distributions. The 
effect of violation of the normality assumption on the test statistic decreases as the size of the 
sample employed in an experiment increases. The homogeneity of variance assumption noted 
for the ¢ test for two dependent samples is not an assumption of the z test for two dependent 
samples. 

The null and alternative hypotheses employed for the z test for two dependent samples 
are identical to those employed for the ¢ test for two dependent samples. Equation 17.21 is 
employed to compute the test statistic for the z test for two dependent samples." 


X, 7 X, 
zc (Equation 17.21) 
|x + oF - ry x) (0x Ox) 








Where: ox = o;/n and og = ol n 

The only differences between Equation 17.21 and Equation 17.6 (the equation for the t test 
for two dependent samples) are: a) In the denominator of Equation 17.21, in computing the 
standard error of the mean for each condition, the population standard deviations c, and o, are 
employed instead of the estimated population standard deviations 5, and 5, (which are em- 
ployed in Equation 17.6); and b) Equation 17.21 computes a z score which is evaluated with the 
normal distribution, while Equation 17.6 derives a f score which is evaluated with the f dis- 
tribution. 

If it is assumed that the two population variances are known for Example 17.1, and that 
o? - 8.01 and o = 5.66, Equation 17.21 can be employed to evaluate the data. Note that 
the obtained value z = 2.86 is identical to the value that was computed for t when Equation 17.6 
was employed. 

z= 4.7 - 3.1 - 2.86 


y.79 + .56 - 2(.78)(.89)(.75) 


The obtained value z = 2.86 is evaluated with Table A1 (Table of the Normal Dis- 
tribution) in the Appendix. In Table A1 the tabled critical two-tailed .05 and .01 values are 
Zos = 1.96 and Z o = 2.58, and the tabled critical one-tailed .05 and .01 values are z  — 1.65 
and Zo = 2.33. Since the computed value z = 2.86 is greater than the tabled critical two-tailed 
values zo, = 1.96 and z,, = 2.58, the nondirectional alternative hypothesis H,: p, * p, 
is supported at both the .05 and .01 levels. Since the computed value z = 2.86 is a positive number 
which is greater than the tabled critical one-tailed values zo, = 1.65 and Zo = 2.33, the 
directional alternative hypothesis H,: u, > p, is also supported at both the .05 and .01 levels. 

When the same set of data are evaluated with the £ test for two dependent samples, al- 
though the directional alternative hypothesis H,: p, > p, is supported at both the .05 and .01 
levels, the nondirectional alternative hypothesis H,: pu, * m, is only supported at the .05 level. 
This latter fact illustrates that if the z test for two dependent samples and the f test for two 
dependent samples are employed to evaluate the same set of data (unless the value of n is 
extremely large), the latter test will provide a more conservative test of the null hypothesis (1.e., 
make it more difficult to reject H,). This is the case, since the tabled critical values listed for the 
z test for two dependent samples will always correspond to the tabled critical values listed in 
Table A2for df = œ (which are the lowest tabled critical values listed for the t distribution). 

The final part of the discussion of the z test for two dependent samples will describe a 
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special case of the test in which it is employed to evaluate the difference between the average 
performance of two conditions, when the scores of subjects are based on a binomially distributed 
variable. Example 17.2, which is used to illustrate this application of the test, is the dependent 
samples analog of Example 11.3 (which illustrates the analysis of a binomially distributed 
variable with the z test for two independent samples). The null and alternative hypotheses 
evaluated in Example 17.2 are identical to those evaluated in Example 17.1. 


Example 17.2 An experiment is conducted in which each of five subjects is tested for extra- 
sensory perception under two experimental conditions. In Condition 1 a subject listens to a 
relaxation training tape, after which the subject is tested while in a relaxed state of mind. In 
Condition 2 each subject is tested while in a normal state of mind. Assume that the order of 
presentation of the two experimental conditions is counterbalanced, although not completely, 
since to do the latter would require that an even number of subjects be employed in the study. 
Thus, three of the five subjects initially serve in Condition 1 followed by Condition 2, while the 
remaining two subjects initially serve in Condition 2 followed by Condition 1. (The concept of 
counterbalancing is discussed in Section VII.) 

In each experimental condition a subject is tested for 200 trials. In each condition the 
researcher employs as stimuli a different list of 200 binary digits (specifically, the values 0 and 
1) which have been randomly generated by a computer. On each trial, an associate of the 
researcher concentrates on a digit in the order it appears on the list for that condition. While 
the associate does this, a subject is required to guess the value of the number that is employed 
as the stimulus for that trial. The number of correct guesses for subjects under the two experi- 
mental conditions follow. (The first score for each subject is the number of correct responses 
in Condition 1, and the second score is the number of correct responses in Condition 2.): 
Subject 1 (105, 90); Subject 2 (120, 104); Subject 3 (130, 107); Subject 4 (115, 100); Subject 
5 (110, 99). Table 17.6 summarizes the data for the experiment. 


Table 17.6 Data for Example 17.2 


Condition 1 Condition 2 
Subject X, X X, x Xx 
1 105 11025 90 8100 9450 
2 120 14400 104 10816 12480 
3 130 16900 107 11449 13919 
4 115 13225 100 10000 11500 
5 110 12100 99 9801 10890 


XX, -580  YXj - 67650 XX, -500 YX; = 50166  XX,X, = 58230 


X, = 380 - 116 X, - 399 . 100 
5 
Note that in Example 17.2, the five scores in each of the two experimental conditions are 
identical to the five scores employed in the two experimental conditions in Example 11.3. The 
only difference between the two examples is the order in which the scores are listed. Specifically, 
in Example 17.2 the scores have been arranged so that the two scores in each row (i.e., the two 
scores of each subject) have a high positive correlation with one another. Through use of Equa- 
tion 17.7, itis demonstrated that the correlation between subjects' scores in the two experimental 
conditions is Ty x, = 93. 
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Example 17.2 will be evaluated with Equation 17.22, which is the form Equation 17.21 
assumes when o - oj. 


X -Xx : 

= ———————— (Equation 17.22) 

| sd Uxx 

g | xo e L2— 

m m m 

Note that in Equation 17.22, m is employed to represent the number of subjects, since the 
notation n is employed with the binomial variable to designate the number of trials in which each 
subject is tested. Since scores on the binary guessing task described in Example 17.2 are 
assumed to be binomially distributed, as is the case in Example 11.3, the following is true: n = 
200, 1, = .5,and m, = .5. The computed value for the population standard deviation for the 
binomially distributed variable is o = nmm, = §(200)(.5)(.5) = 7.07. (The computation of 
the latter values is discussed in Section I of the binomial sign test for a single sample (Test 9).) 
When the appropriate values are substituted in Equation 17.22, the value z 2 13.52 is computed. 


z= . . 1.ié-10  »?— = 13.52 


$07 85b CA 
ET 5 





Since the computed value z = 13.52 is greater than the tabled critical two-tailed values 
Zos = 1.96 and Zo, = 2.58, the nondirectional alternative hypothesis H,: p; * p, is sup- 
ported at both the .05 and .01 levels. Since the computed value z = 13.52 is a positive number 
that is greater than the tabled critical one-tailed values Z; = 1.65 and zy, = 2.33, the 
directional alternative hypothesis H,: p, > p is supported at both the .05 and .01 levels. Thus, 
it can be concluded that the average score in Condition 1 is significantly larger than the average 
score in Condition 2. Note that when Equation 11.17 is employed with the same set of data, it 
yields the value z = 3.58. The fact that the value z = 13.52 obtained with Equation 17.22 is 
larger than the value z 2 3.58 obtained with Equation 11.17, illustrates that if there is a positive 
correlation between the scores of subjects employed in a dependent samples design, a z test for 
two dependent samples will provide a more powerful test of an alternative hypothesis than will 
az test for two independent samples (due to the lower value of the denominator for the former 
test). 


VII. Additional Discussion of the t Test for Two Dependent Samples 


1. The use of matched subjects in a dependent samples design It is noted in Section I that 
the ¢ test for two dependent samples can be applied to a design involving matched subjects. 
Matching subjects requires that a researcher initially identify one or more variables (besides the 
independent variable) which she believes are positively correlated with the dependent variable 
employed in a study. Such a variable can be employed as a matching variable. Each subject who 
is assigned to one of the k experimental conditions is matched with one subject in each of the 
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other (k — 1) experimental conditions. In matching subjects it is essential that any cohort of sub- 
jects who are matched with one another are equivalent (or reasonably comparable) with respect 
to any matching variables employed in the study. In a design employing matched subjects there 
will be n cohorts (also referred to as blocks) of matched subjects, and within each cohort there 
will k subjects. Each of the k subjects should be randomly assigned to one of the k 
experimental conditions/levels of the independent variable. Thus, when k = 2, each of the two 
subjects within the n pairs/cohorts will be randomly assigned to one of the two experimental 
conditions. 

By matching subjects a researcher is able to conduct a more powerful statistical analysis 
than will be the case if subjects in the two conditions are not matched with one another (i.e., if 
an independent samples design is employed). The more similar the cohorts of subjects are on the 
matched variable(s), the greater the power of the statistical test. In actuality, the most extreme 
case of matching is when each subject is matched with him or herself, and thus serves in each of 
the k experimental conditions. Within this framework, the design employed for Example 17.1 
can be viewed as a matched-subjects design. However, as the term matching is most commonly 
employed, within each row of the data summary table (i.e., Table 17.1) different subjects serve 
in the each of the k experimental conditions. When k = 2, the most extreme case of matching 
involving different subjects in each condition is when n pairs of identical twins are employed as 
subjects. By virtue of their common genetic makeup, identical twins allow an experimenter to 
match a subject with his or her “clone.” In Example 17.3 identical twins are employed as sub- 
jects in the same experiment described by Example 17.1. Analysis of Example 17.3 with the f 
test for two dependent samples yields the same result as that obtained for Example 17.1, since 
both examples employ the same set of data. 


Example 17.3 A psychologist conducts a study to determine whether or not people exhibit 
more emotionality when they are exposed to sexually explicit words than when they are exposed 
to neutral words. Ten sets of identical twins are employed as subjects. Within each twin pair, 
one of the twins is randomly assigned to Condition 1, in which the subject is shown a list of eight 
sexually explicit words, while the other twin is assigned to Condition 2, in which the subject is 
shown a list of eight neutral words. As each word is projected on the screen, a subject is 
instructed to say the word softly to him or herself. As a subject does this, sensors attached to the 
palms of the subject's hands record galvanic skin response (GSR), which is used by the 
psychologist as a measure of emotionality. The psychologist computes two scores for each pair 
of twins to represent the emotionality score for each of the experimental conditions: Condition 
1: GSR/Explicit — The average GSR score for the twin presented with the eight sexually explicit 
words; Condition 2: GSR/Neutral — The average GSR score for the twin presented with the 
eight neutral words. The GSR/Explicit and the GSR/Neutral scores of the ten pairs of twins 
follow. (The first score for each twin pair represents the score of the twin presented with the 
sexually explicit words, and the second score represents the score of the twin presented with the 
neutral words. The higher the score, the higher the level of emotionality.) Twin pair 1 (9, 8); 
Twin pair 2 (2, 2); Twin pair 3 (1, 3); Twin pair 4 (4, 2); Twin pair 5 (6, 3); Twin pair 6 (4, 
0); Twin pair 7 (7, 4); Twin pair 8 (8, 5); Twin pair 9 (5, 4); Twin pair 10 (1,0). Do subjects 
exhibit differences in emotionality with respect to the two categories of words? 


In the event k 2 3 and a researcher wants to use identical siblings, identical triplets can be 
employed in a study. If k = 4, identical quadruplets can be used, and so on. If the critical 
variable(s) the researcher wants to match subjects with respect to are believed to be influenced 
by environmental factors, the suitability of employing identical siblings as matched subjects will 
be compromised to the degree that within each set of siblings the members of the set do not share 
common environmental experiences. Realistically, the number of available identical siblings in 
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a human population will be quite limited. Thus, with the exception of identical twins, it would 
be quite unusual to encounter a study that employs identical siblings. Because of the low fre- 
quency of identical siblings in the general population, in matching subjects a researcher may elect 
to employ biological relatives who share less in common with one another or employ people who 
are not blood relatives. Example 17.4 illustrates the latter type of matching. Analysis of 
Example 17.4 with the ¢ test for two dependent samples yields the same result as that obtained 
for Examples 17.1 and 17.3, since all three experiments employ the same set of data. 


Example 17.4 A psychologist conducts a study to determine whether or not people exhibit 
more emotionality when they are exposed to sexually explicit words than when they are exposed 
to neutral words. Based on previous research, the psychologist has reason to believe that the 
following three variables are highly correlated with the dependent variable of emotionality: a) 
gender; b) autonomic hyperactivity (which is measured by a series of physiological measures); 
and c) repression/sensitization (which is measured by a pencil and paper personality test). Ten 
pairs of matched subjects who are identical (or very similar) on the three aforementioned 
variables are employed in the study. Within each pair, one person is randomly assigned to a 
condition in which the subject is shown a list of eight sexually explicit words, while the other 
person is assigned to a condition in which the subject is shown a list of eight neutral words. As 
each word is projected on the screen, a subject is instructed to say the word softly to him or 
herself. As a subject does this, sensors attached to the palms of the subject's hands record 
galvanic skin response (GSR), which is used by the psychologist as a measure of emotionality. 
The psychologist computes two scores for each pair of matched subjects to represent the 
emotionality score for each of the experimental conditions: Condition 1: GSR/Explicit — The 
average GSR score for the subject presented with the eight sexually explicit words; Condition 
2: GSR/Neutral — The average GSR score for the subject presented with the eight neutral 
words. The GSR/Explicit and the GSR/Neutral scores of the ten pairs of subjects follow. (The 
first score for each pair represents the score of the person presented with the sexually explicit 
words, and the second score represents the score of the person presented with the neutral words. 
The higher the score, the higher the level of emotionality.) Pair 1 (9, 8); Pair 2 (2, 2); Pair 3 
(1, 3); Pair 4 (4, 2); Pair 5 (6, 3); Pair 6 (4, 0); Pair 7 (7, 4); Pair 8 (8, 5); Pair 9 (5, 4); Pair 
10 (1, 0). Do subjects exhibit differences in emotionality with respect to the two categories of 
words? 


One reason a researcher may elect to employ matched subjects (as opposed to employing 
each subject in all k experimental conditions) is because in many experiments it is not feasible 
to have a subject serve in more than one condition. Specifically, a subject's performance in one 
or more of the conditions might be influenced by his or her experience in one or more of the 
conditions that precede it. In some instances counterbalancing can be employed to control for 
such effects, but in other cases even counterbalancing does not provide the necessary control. 

In spite of the fact that it can increase the power of a statistical analysis, matching is not 
commonly employed in experiments involving human subjects. The reason for this is that match- 
ing requires a great deal of time and effort on the part of a researcher. Not only is it necessary 
to identify one or more matching variables that are correlated with the dependent variable, but 
it is also necessary to identify and obtain the cooperation of a sufficient number of matched 
subjects to participate in an experiment. The latter does not present as much of a problem in 
animal research, where litter mates can be employed as matched cohorts. Example 17.5, which 
is evaluated in the next section, illustrates a design that employs animal litter mates as subjects. 


Example 17.5 A researcher wants to assess the relative effect of two different kinds of punish- 
ment (loud noise versus a blast of cold air) on the emotionality of mice. Five pairs of mice 
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derived from five separate litters are employed as subjects. Within each pair, one of the litter 
mates is randomly assigned to one of two experimental conditions. During the course of the 
experiment each mouse is sequestered in an experimental chamber. While in the chamber, each 
of the five mice in Condition 1 is periodically presented with a loud noise, and each of the five 
mice in Condition 2 is periodically presented with a blast of cold air. The presentation of the 
punitive stimulus for each of the animals is generated by a machine that randomly presents the 
stimulus throughout the duration of the time an animal is in the chamber. The dependent vari- 
able of emotionality employed in the study is the number of times each mouse defecates while in 
the experimental chamber. The number of episodes of defecation for the five pairs of mice 
follows. (The first score represents the litter mate exposed to noise and the second score 
represents the litter mate exposed to cold air.) Litter 1 (11, 11); Litter 2 (1, 11); Litter 3 (0, 
5); Litter 4 (2, 8); Litter 5 (0, 4). Do subjects exhibit differences in emotionality under the 
different experimental conditions? 


2. Relative power of the ¢ test for two dependent samples and the ¢ test for two independent 
samples Example 17.5 will be employed to illustrate that the t test for two dependent samples 
provides a more powerful test of an alternative hypothesis than does the ¢ test for two inde- 
pendent samples. Except for the fact that it employs a dependent samples design involving 
matched subjects, Example 17.5 is identical to Example 11.4 (which employs an independent 
samples design). Both examples employ the same set of data and evaluate the same null and 
alternative hypotheses. The summary values for evaluating Example 17.5 with the f test for two 
dependent samples (using either Equation 17.1 or 17.6) are noted below. Some of the values 
listed can also be found in Table 11.1 (which summarizes the same set of data for analysis with 
the £ test for two independent samples). 


EX, ed4 EX? 126 XX £30 EX 347 X =28 X 78 


EXX, = 148 304-217 30-107 Tyy = .64 


XD = -25  XD'-17] D--5 §, =3.61 sz = 1.61 


Since n = 5 in Example 17.5, df = 5 - 1 = 4. In Table A2, for df = 4, the tabled critical 
two-tailed .05 and .01 values are £9, = 2.78 and t,, = 4.60, and the tabled critical one-tailed 
.05 and .01 values are ty, = 2.13 and fq = 3.75. 

The nondirectional alternative hypothesis H,: pu, * p is supported at the .05 level, since 
the computed absolute value t = 3.10 is greater than the tabled critical two-tailed value 
tos = 2.78. It is not, however, supported at the .01 level, since the absolute value t = 3.10 is 
less than the tabled critical two-tailed value ty, = 4.60. 

The directional alternative hypothesis H,: p, < p, is supported at the .05 level, since the 
computed value t = —3.10 is a negative number, and the absolute value t = 3.10 is greater than 
the tabled critical one-tailed value ź į; = 2.13. It is not, however, supported at the .01 level 
since the absolute value t = 3.10 is less than the tabled critical one-tailed value t,, = 3.75. 

The directional alternative hypothesis H,: 4, > p, is not supported, since the computed 
value t = —3.10 is a negative number. 

Note that the absolute value t = —3.10 computed for Example 17.5 is substantially 
higher than the absolute value 1 = 1.96 computed for Example 11.4 (which has the same data as 
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Example 11.1) with the £ test for two independent samples. In the case of Example 11.4 (as 
well as Example 17.5), the directional alternative hypothesis H,: pu, < p, is supported at the 
.05 level. However, the nondirectional alternative hypothesis H,: pu, * p (which is supported 
at the .05 level in the case of Example 17.5) is not supported in Example 11.4 when the data are 
evaluated with the £ test for two independent samples. The difference in the conclusions 
reached with the two tests reflects the fact that the ¢ test for two dependent samples provides 
a more powerful test of an alternative hypothesis (assuming there is a positive correlation (which 
in the example under discussion is yx = .64) between the scores of subjects in the two exper- 
imental conditions). In closing this discussion, itis worth noting that designs involving indepen- 
dent samples are more commonly employed in research than designs involving dependent 
samples. The reason for this is that over and above the fact that a dependent samples design 
allows for a more powerful test of an alternative hypothesis, it presents more practical problems 
in its implementation (e.g., controlling for problems that might result from subjects serving in 
multiple conditions; the difficulty of identifying matching variables; identifying and obtaining 
the cooperation of an adequate number of matched subjects). 


3. Counterbalancing and order effects When each of the n subjects in an experiment serves 
in all k experimental conditions, it is often necessary to control for the order of presentation of 
the conditions. Thus, if all n subjects are administered Condition 1 first followed by Condition 
2 (or vice versa), factors such as practice or fatigue, which are a direct function of the order of 
presentation of the conditions, can differentially affect subjects’ scores on the dependent variable. 
Specifically, subjects may perform better in Condition 2 due to practice effects or subjects may 
perform worse in Condition 2 as a result of fatigue. As noted earlier, counterbalancing is a 
procedure which allows a researcher to control for such order effects. In complete counter- 
balancing all possible orders for presenting the experimental conditions are represented an equal 
number of times with respect to the total number of subjects employed in a study. Thus, if a 
study with n = 10 subjects and k = 2 conditions is completely counterbalanced, five subjects will 
initially serve in Condition 1 followed by Condition 2, while the other five subjects will initially 
serve in Condition 2 followed by Condition 1. If the number of experimental conditions is 
k 23, there will be k! = 3! = 6 possible presentation orders (i.e., 1,2,3; 1,3,2; 2,1,3; 2,3,1; 3,1,2; 
3,2,1). Under such conditions a minimum of six subjects will be required in order to employ 
complete counterbalancing. If a researcher wants to assign two subjects to each of the pre- 
sentation orders, 6 x 2 = 12 subjects must be employed. It should be obvious that to completely 
counterbalance the order of presentation of the experimental conditions, the number of subjects 
must equal the value of k! or be some value that is evenly divisible by it. 

As the number of experimental conditions increase, complete counterbalancing becomes 
more difficult to implement, since the number of subjects required increases substantially. 
Specifically, if there are k = 5 experimental conditions, there are 5! = 120 presentation orders — 
thus requiring a minimum of 120 subjects (which can be a prohibitively large number for a 
researcher to use) in order that one subject serves in each of the possible presentation orders. 
When it is not possible to completely counterbalance the order of presentation of the conditions, 
alternative less complete counterbalancing procedures are available. The Latin square design 
(which is discussed in Section VII of the single-factor within-subjects analysis of variance) 
can be employed to provide incomplete counterbalancing (i.e., the latter design uses some but 
not all of the possible orders for presenting the experimental conditions). The Latin Square 
design is more likely to be considered as a reasonable option for controlling for order effects 
when the independent variable is comprised of many levels (and consequently it becomes 
prohibitive to employ complete counterbalancing). 
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4. Analysis of a before-after design with the ¢ test for two dependent samples In a before- 
after design n subjects are administered a pretest on a dependent variable. After the pretest, all 
n subjects are exposed to the experimental treatment. Subjects are then administered a posttest 
on the same dependent variable. The f test for two dependent samples can be employed to 
determine if there is a significant difference between the pretest versus posttest scores of subjects. 
Although there are published studies that employ the ¢ test for two dependent samples to 
evaluate the aforementioned design, it is important to note that since it lacks a control group, a 
before-after design does not allow a researcher to conclude that the experimental treatment is 
responsible for a significant difference. If a significant result is obtained in such a study, it only 
allows the researcher to conclude that there is a significant statistical association/correlation 
between the experimental treatment and the dependent variable. Since correlational information 
does not allow one to draw conclusions with regard to cause and effect, a researcher cannot con- 
clude that the treatment is directly responsible for the observed difference. Although it is 
possible that the treatment is responsible for the difference, it is also possible that the difference 
is due to one or more other variables that intervened between the pretest and the posttest. 

To modify a before-after design to insure adequate experimental control, it is required that 
two groups of subjects be employed. In such a modification, pretest and posttest scores are ob- 
tained for both groups, but only one of the groups (the experimental group) is exposed to the 
experimental treatment in the time period that intervenes between the pretest and the posttest. 
By virtue of employing a control group which is not exposed to the experimental treatment, the 
researcher is able to rule out the potential influence of confounding variables. Thus, order 
effects, as well as other factors in the environment that subjects may have been exposed to 
between the pretest and the posttest, can be ruled out through use of a control group. Example 
17.6 illustrates a study that employs a before-after design without the addition of the necessary 
control group. 


Example 17.6 In order to assess the efficacy of electroconvulsive therapy (ECT), a psychiatrist 
evaluates ten clinically depressed patients before and after a series of ECT treatments. A 
standardized interview is used to operationalize a patient's level of depression, and on the basis 
of the interview each patient is assigned a score ranging from 0 to 10 with respect to his or her 
level of depression prior to (pretest score) and after (posttest score) the administration of ECT. 
The higher a patient's score, the more depressed the patient. The pretest and posttest scores of 
the ten patients follow: Patient 1 (9, 8); Patient 2 (2, 2); Patient 3 (1, 3); Patient 4 (4, 2); 
Patient 5 (6, 3); Patient 6 (4, 0); Patient 7 (7, 4); Patient 8 (8, 5); Patient 9 (5, 4); Patient 10 
(1, 0). Do the data indicate that ECT is effective?" 


Since the data for Example 17.6 are identical to that employed in Example 17.1, it yields the 
same result. Thus, analysis of the data with the t test for two dependent samples indicates that 
there is a significant decrease in depression following the ECT. However, as previously noted, 
because there is no control group, the psychiatrist cannot conclude that ECT is responsible for the 
decrease in depression. Inclusion of a “sham” ECT group (which is analogous to a placebo group) 
can provide the necessary control to evaluate the impact of ECT. Such a group would be 
comprised of ten additional patients for whom pretest and posttest depression scores are obtained. 
Between the pretest and posttest, the patients in the control group undergo all of the preparations 
involved in ECT, but are only administered a simulated ECT treatment (i.e., they are not actually 
administered the shock treatment). Only by including such a control group, can one rule out the 
potential role of extraneous variables that might also be responsible for the lower level of de- 
pression during the posttest. An example of such an extraneous variable would be if all of the 
subjects who receive ECT are in psychotherapy throughout the duration of the experiment. 
Without a control group, a researcher cannot determine whether a lower posttest depression score 
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is the result of the ECT, the psychotherapy, the ECT and psychotherapy interacting with one 
another, or some other variable of which the researcher is unaware. By including a control 
group, it is assumed (although not insured) that if any extraneous variables are present, by virtue 
of randomly assigning subjects to groups, the groups will be equated on such variables. 

When the before-after design is modified by the addition of the appropriate control group, 
the resulting design is referred to as a pretest-posttest control group design. Unfortunately, 
researchers are not in agreement with respect to what statistical analysis is most appropriate for 
the latter design. Among the analytical procedures that have been recommended are the fol- 
lowing: a) The difference scores of the two groups can be contrasted with a f test for two 
independent samples; b) The results can be evaluated by employing a factorial analysis of 
variance for a mixed design (Test 27i). A factorial design has two or more independent vari- 
ables (which are also referred to as factors). Thus, if the appropriate control group is employed 
in Example 17.6, the resulting pretest-posttest control group design can be conceptualized as 
being comprised of two independent variables. One of the independent variables is represented 
by the ECT versus sham ECT manipulation. The second independent variable is the pretest- 
posttest dichotomy. In a mixed factorial design involving two factors, one of the independent 
variables is a between-subjects variable (i.e., different subjects serve under different levels of that 
independent variable). Thus, in the example under discussion, the ECT versus sham ECT 
manipulation represents a between-subjects independent variable. The other independent 
variable in a mixed factorial design is a within-subjects variable (i.e., each subject serves under 
all levels of that independent variable). In the example under discussion, the pretest-posttest 
dichotomy represents a within-subjects independent variable; and c) The single-factor between- 
subjects analysis of covariance (Test 21j) (a procedure that is discussed in Section IX (the 
Addendum) of the single-factor between-subjects analysis of variance (Test 21)) can also be 
employed to evaluate a pretest-posttest control group design. In conducting an analysis of 
covariance, the pretest scores of subjects are employed as the covariate. 


VIII. Additional Example Illustrating the Use of the ¢ Test for 
Two Dependent Samples 


Example 17.7 is an additional example that can be evaluated with the f test for two dependent 
samples. Since Example 17.7 employs the same data as Example 17.1, it yields the same result. 
Note that in Example 17.7 complete counterbalancing is employed in order to control for order 
effects. 


Example 17.7 A study is conducted to evaluate the relative efficacy of two drugs (Clearoxin 
and Lesionoxin) on chronic psoriasis. Ten subjects afflicted with chronic psoriasis participate 
in the study. Each subject is exposed to both drugs for a six-month period, with a three-month 
hiatus between treatments. Five subjects are treated with Clearoxin initially, after which they 
are treated with Lesionoxin. The other five subjects are treated with Lesionoxin first and then 
with Clearoxin. The dependent variable employed in the study is a rating of the severity of a 
subject's lesions under the two drug conditions. The higher the rating the more severe a 
subject's psoriasis. The scores of the ten subjects under the two treatment conditions follow. 
(The first score represents the Clearoxin condition (which represents Condition 1), and the 
second score the Lesionoxin condition (which represents Condition 2).) Subject 1 (9, 8); 
Subject 2 (2, 2); Subject 3 (1, 3); Subject 4 (4, 2); Subject 5 (6, 3); Subject 6 (4, 0); Subject 
7 (7, 4); Subject 8 (8, 5); Subject 9 (5, 4); Subject 10 (1, 0). Do the data indicate that subjects 
respond differently to the two types of medication? 
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Endnotes 


1. Alternative terms that are employed in describing a dependent samples design are re- 
peated measures design, within-subjects design, treatment-by-subjects design, corre- 
lated samples design, matched-subjects design, and randomized-blocks design. The 
use of the terms blocks within the framework of a dependent samples design is discussed 
in Endnote 1 of the single-factor within-subjects analysis of variance. 


2. A study has internal validity to the extent that observed differences between the ex- 
perimental conditions on the dependent variable can be unambiguously attributed to a 
manipulated independent variable. Random assignment of subjects to the different ex- 
perimental conditions is the most effective way to insure internal validity (by eliminating 
the possible influence of confounding/extraneous variables). In contrast to internal 
validity, external validity refers to the degree to which the results of an experiment can 
be generalized. The results of an experiment can only be generalized to a population of 
subjects, as well as environmental conditions, that are comparable to those that are 
employed in the experiment. 


3. In actuality, when galvanic skin response (which is a measure of skin resistance) is meas- 
ured, the higher a subject's GSR the less emotional the subject. In Example 17.1, it is 
assumed that the GSR scores have been transformed so that the higher a subject's GSR 
score, the greater the level of emotionality. 


4. An alternative but equivalent way of writing the null hypothesis is Hy: pu, - p, = 0. 
The analogous alternative but equivalent ways of writing the alternative hypotheses in 
the order they are presented are: H,: u, - jp, * 0; Hi: pj - pj > O;and H: p, - p, 
« 0. 


5. Note that the basic structure of Equation 17.3 is the same as Equations I.8/2.1 (the equa- 


tion for the estimated population standard deviation that is employed within the frame- 
work of the single-sample ¢ test). In Equation 17.3 a standard deviation is computed 
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for n D scores, whereas in Equations I.8/2.1 a standard deviation is computed for n X 
scores. 


6. The actual value that is estimated by s5 is 05, which is the standard deviation of the 
sampling distribution of mean difference scores for the two populations. The meaning of 
the standard error of the mean difference can be best understood by considering the 
following procedure for generating an empirical sampling distribution of difference scores: 
a) Obtain n difference scores for a random sample of n subjects; b) Compute the mean 
difference score (D) for the sample; and c) Repeat steps a) and b) m times. At the con- 
clusion of this procedure one will have obtained m mean difference scores. The standard 
error of the mean difference represents the standard deviation of the m mean difference 
scores, and can be computed by substituting the term D for D in Equation 17.3. Thus: 


s5 =V [XD? - ((XDy/m)|/[n - 1]. The standard deviation that is computed with 
Equation 17.4 is an estimate of og. 


7.  Inorderfor Equation 17.1 to be soluble, there must be variability in the n difference scores. 
If each subject produces the same difference score, the value of $, computed with Equation 
17.3 will equal 0. As a result of the latter, Equation 17.4 will yield the value s5 = 0. 
Since s5 is the denominator of Equation 17.1, when the latter value equals zero, the f test 
equation will be insoluble. 


8.  Thenumerator of Equation 17.6 will always equal D (i.e., the numerator of Equation 17.1). 
In the same respect the denominator of Equation 17.6 will always equal s5 (the denom- 
inator of Equation 17.1). The denominator of Equation 17.6 can also be written as follows: 





9. Note that in the case of Example 11.1 (which is employed to illustrate the £ test for 
two independent samples), it is reasonable to assume that scores in the same row of Table 
11.1 (which summarizes the data for the study) will not be correlated with one another (by 
virtue of the fact that two independent samples are employed in the study). When in- 
dependent samples are employed, it is assumed that random factors determine the values 
of any pair of scores in the same row of a table summarizing the data, and consequently it 
is assumed that the correlation between pairs of scores in the same row will be equal to (or 
close to) 0. 


10. Due to rounding off error, there may be a slight discrepancy between the value of t com- 
puted with Equations 17.1 and 17.6. 


11. Anoncorrelational procedure that allows a researcher to evaluate whether a treatment effect 
is present in the above described example is Fisher's randomization procedure (Fisher 
(1935)), which is generally categorized as a permutation test. The randomization test 
for two independent samples (Test 12a), which is an example of a test that is based on 
Fisher's randomization procedure, is described in Section IX (the Addendum) of the 
Mann-Whitney U test (Test 12). Fisher's randomization procedure requires that all 
possible score configurations which can be obtained for the value of the computed sum of 
the difference scores be determined. Upon computing the latter information, one can deter- 
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12: 


13. 


14. 


15. 


16. 


mine the likelihood of obtaining a configuration of scores that is equal to or more extreme 
than the one obtained for a set of data. 


Equation 17.1 can be modified as follows to be equivalent to Equation 17.8: 
[- [D - (p, ~ A3)l/sg. 


Although Equation 17.15 is intended for use prior to collecting the data, it should yield the 
same value for o, if the values computed for the sample data are substituted in it. Thus, 
if o = 2.60 (which is the average of the values $, = 2.83 and $, = 2.38), and Px x, (which 


is the population correlation coefficient estimated by the value ry y = .78) are substituted 
1^2 


in Equation 17.15, the value oy = 1.72 is computed, which is quite close to the com- 
puted value $5 = 1.78. The slight discrepancy between the two values can be attributed 
to the fact that the estimated population standard deviations are not identical. 


In contrast to the £ test for two dependent samples (which can only be employed with two 
dependent samples), the single-factor within-subjects analysis of variance can be used 
with a dependent samples design involving interval/ratio data in which there are k samples, 
where k > 2. 


Note that the basic structure of Equation 17.18 is the same as Equation 11.15 (which is 
employed for computing a confidence interval for the f test for two independent samples), 
except that the latter equation employs Sx x, in place of s5. 

It was noted earlier that if all n subjects obtain the identical difference score, Equations 
17.1/17.6 become unsolvable. In the case of Equation 17.19, for a given value of n, if all 
n subjects obtain the same difference score the same A value will always be computed, 
regardless of the magnitude of the identical difference score obtained by each of the n 
subjects. If the value of A computed under such conditions is substituted in the equation 
t = (n - 1)/(An - 1) (which is algebraically derived from Equation 17.20), the latter 
equation becomes unsolvable (since the value (An — 1) will always equal zero). The con- 
clusion that results from this observation is that Equation 17.19 is insensitive to the 
magnitude of the difference between experimental conditions when all subjects obtain the 
same difference score. 





17. Equation 17.21 can also be written as follows: 


18. 


X, -X 
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In instances when a researcher stipulates in the null hypothesis that the difference between 
the two population means is some value other than zero, the numerator of Equation 17.21 
is the same as the numerator of Equation 17.8. The protocol for computing the value of the 
numerator is identical to that employed for Equation 17.8. 


In Example 17.1, the order of presentation of the conditions is controlled by randomly 
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distributing the sexually explicit and neutral words throughout the 16 word list presented 
to each subject. 


19. The doctor conducting the study might feel it would be unethical to employ a group of 


comparably depressed subjects as a control group, since patients in such a group would be 
deprived of a potentially beneficial treatment. 
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Test 18 


The Wilcoxon Matched-Pairs Signed-Ranks Test 
(Nonparametric Test Employed with Ordinal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Do two dependent samples represent two different popula- 
tions? 


Relevant background information on test The Wilcoxon matched-pairs signed-ranks test 
(Wilcoxon (1945, 1949)) is a nonparametric procedure employed in a hypothesis testing situation 
involving a design with two dependent samples. Whenever one or more of the assumptions of 
the £ test for two dependent samples (Test 17) are saliently violated, the Wilcoxon 
matched-pairs signed-ranks test (which has less stringent assumptions) may be preferred as 
an alternative procedure. Prior to reading the material on the Wilcoxon matched-pairs signed- 
ranks test, the reader may find it useful to review the general information regarding a dependent 
samples design contained in Sections I and VII of the ¢ test for two dependent samples. 

The Wilcoxon matched-pairs signed-ranks test is essentially an extension of the Wil- 
coxon signed-ranks test (Test 6) (which is employed for a single sample design) to a design 
involving two dependent samples. In order to employ the Wilcoxon matched-pairs signed- 
ranks test, it is required that each of n subjects (or n pairs of matched subjects) has two 
interval/ratio scores (each score having been obtained under one of the two experimental condi- 
tions). A difference score is computed for each subject (or pair of matched subjects) by 
subtracting a subject's score in Condition 2 from his score in Condition 1. The hypothesis 
evaluated with the Wilcoxon matched-pairs signed-ranks test is whether or not in the under- 
lying populations represented by the samples/experimental conditions, the median of the dif- 
ference scores (which will be represented by the notation 05) equals zero. If a significant 
difference is obtained, it indicates there is a high likelihood the two samples/conditions represent 
two different populations. 

The Wilcoxon matched-pairs signed-ranks test is based on the following assumptions:! 
a) The sample of n subjects has been randomly selected from the population it represents; b) The 
original scores obtained for each of the subjects are in the format of interval/ratio data; and c) 
The distribution of the difference scores in the populations represented by the two samples is 
symmetric about the median of the population of difference scores. 

As is the case for the ¢ test for two dependent samples, in order for the Wilcoxon 
matched-pairs signed-ranks test to generate valid results, the following guidelines should be 
adhered to: a) To control for order effects, the presentation of the two experimental conditions 
should be random or, if appropriate, be counterbalanced; and b) If matched samples are em- 
ployed, within each pair of matched subjects each of the subjects should be randomly assigned 
to one of the two experimental conditions. 

As is the case with the f test for two dependent samples, the Wilcoxon matched-pairs 
signed-ranks test can also be employed to evaluate a before-after design. The limitations of 
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the before-after design (which are discussed in Section VII of the ¢ test for two dependent 
samples) are also applicable when it is evaluated with the Wilcoxon matched-pairs signed- 
ranks test. 

It should be noted that all of the other tests in this text that rank data (with the exception of 
the Wilcoxon signed-ranks test and the Moses test for equal variability (Test 15)), rank the 
original interval/ratio scores of subjects. The Wilcoxon matched-pairs signed-ranks test, 
however, does not rank the original interval/ratio scores, but instead ranks the interval/ratio 
difference scores of subjects (or matched pairs of subjects). For this reason, some sources 
categorize the Wilcoxon matched-pairs signed-ranks test as a test of interval/ratio data. Most 
sources, however (including this book), categorize the Wilcoxon matched-pairs signed-ranks 
test as a test of ordinal data, by virtue of the fact that a ranking procedure is part of the test 
protocol. 


II. Example 


Example 18.1 is identical to Example 17.1 (which is evaluated with the ¢ test for two dependent 
samples). In evaluating Example 18.1 it will be assumed that the ratio data are rank-ordered, 
since one or more of the assumptions of the ¢ test for two dependent samples have been 
saliently violated. 


Example 18.1 A psychologist conducts a study to determine whether or not people exhibit more 
emotionality when they are exposed to sexually explicit words than when they are exposed to 
neutral words. Each of ten subjects is shown a list of 16 randomly arranged words, which are 
projected onto a screen one at a time for a period of five seconds. Eight of the words on the list 
are sexually explicit and eight of the words are neutral. As each word is projected on the screen, 
a subject is instructed to say the word softly to him or herself. As a subject does this, sensors 
attached to the palms of the subject's hands record galvanic skin response (GSR), which is used 
by the psychologist as a measure of emotionality. The psychologist computes two scores for each 
subject, one score for each of the experimental conditions: Condition 1: GSR/Explicit — The 
average GSR score for the eight sexually explicit words; Condition 2: GSR/Neutral — The 
average GSR score for the eight neutral words. The GSR/Explicit and the GSR/Neutral scores 
of the ten subjects follow. (The higher the score, the higher the level of emotionality.) Subject 
1 (9, 8); Subject 2 (2, 2); Subject 3 (1, 3); Subject 4 (4, 2); Subject 5 (6, 3); Subject 6 (4, 0); 
Subject 7 (7, 4); Subject 8 (8, 5); Subject 9 (5, 4); Subject 10 (1, 0). Do subjects exhibit 


differences in emotionality with respect to the two categories of words? 
III. Null versus Alternative Hypotheses 


Null hypothesis Hy 9, = 0 


(In the underlying populations represented by Condition 1 and Condition 2, the median of the 
difference scores equals zero. With respect to the sample data, this translates into the sum of the 
ranks of the positive difference scores being equal to the sum of the ranks of the negative 
difference scores (i.e., 2R+ = XR-). 


Alternative hypothesis H: 0, # 0 


(In the underlying populations represented by Condition 1 and Condition 2, the median of the 
difference scores is some value other than zero. With respect to the sample data, this translates 
into the sum of the ranks of the positive difference scores not being equal to the sum of the ranks 
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of the negative difference scores (i.e.; Rx + XR-). This is a nondirectional alternative 
hypothesis and it is evaluated with a two-tailed test.) 


or 
Hw D 


(In the underlying populations represented by Condition 1 and Condition 2, the median of the 
difference scores is some value that is greater than zero. With respect to the sample data, this 
translates into the sum of the ranks of the positive difference scores being greater than the sum 
of the ranks of the negative difference scores (i.e., XR+ > XR-). The latter result indicates that 
the scores in Condition 1 are higher than the scores in Condition 2. This is a directional alter- 
native hypothesis and itis evaluated with a one-tailed test.) 


Or 

H; 05 «0 
(In the underlying populations represented by Condition 1 and Condition 2, the median of the 
difference scores is some value that is less than zero (i.e., a negative number). With respect to 
the sample data, this translates into the sum of the ranks of the positive difference scores being 
less than the sum of the ranks of the negative difference scores (i.e., R4 < XR—). The latter 


result indicates that the scores in Condition 2 are higher than the scores in Condition 1. This is 
a directional alternative hypothesis and it is evaluated with a one-tailed test.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected. 


IV. Test Computations 


The data for Example 18.1 are summarized in Table 18.1. Note that there are 10 subjects and 
that each subject has two scores. 


Table 18.1 Data for Example 18.1 


Subject X, X, D=X,-X, Rank of |D| Signed rank of |D | 
1 9 8 1 2, 2 
2 2 2 0 - - 
3 1 3 -2 4.5 —4.5 
4 4 2 2 4.5 4.5 
5 6 3 3 7 7 
6 4 0 4 9 9 
7 7 4 3 7 7 
8 8 5 3 7 7 
9 2 4 1 2 2 
10 1 0 1 2 2 

XR-« = 40.5 
ÈR- = 45 


In Table 18.1, X, represents each subject's score in Condition 1 (sexually explicit words) 
and X, represents each subject's score in Condition 2 (neutral words). In Column 4 of Table 
18.1 a D score is computed for each subject by subtracting a subject's score in Condition 2 
from the subject's score in Condition 1 (i.e., D = X, - X,). In Column 5 the D scores have 
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been ranked with respect to their absolute values. Since the ranking protocol employed for the 
Wilcoxon matched-pairs signed-ranks test is identical to that employed for the Wilcoxon 
signed-ranks test, the reader may find it useful to review the ranking protocol described in 
Section IV of the latter test. To reiterate, the following guidelines should be adhered to when 
ranking the difference scores for the Wilcoxon matched-pairs signed-ranks test. 

a) The absolute values of the difference scores (|D|) are ranked (i.e., the sign of a dif- 
ference score is not taken into account). 

b) Any difference score that equals zero is not ranked. This translates into eliminating from 
the analysis any subject who yields a difference score of zero. 

c) When there are tied scores present in the data, the average of the ranks involved is 
assigned to all scores tied for a given rank. 

d) As is the case with the Wilcoxon signed-ranks test, when ranking difference scores for 
the Wilcoxon matched-pairs signed-ranks test it is essential that a rank of 1 be assigned to the 
difference score with the lowest absolute value, and that a rank of n be assigned to the difference 
score with the highest absolute value (where n represents the number of signed ranks — i.e., 
difference scores that have been ranked)? 

Upon ranking the absolute values of the difference scores, the sign of each difference score 
is placed in front of its rank. The signed ranks of the difference scores are listed in Column 6 of 
Table 18.1. Note that although 10 subjects participated in the experiment there are only n = 9 
signed ranks, since Subject 2 had a difference score of zero which was not ranked. Table 18.2 
summarizes the rankings of the difference scores for Example 18.1. 


Table 18.2 Ranking Procedure for Wilcoxon Matched-Pairs Signed-Ranks Test 


Subject number 2 1 9 10 3 4 5 7 8 6 
Subject's difference score 0 1 1 1 -2 2 3 3 3 4 
Absolute value of difference score — — 1 1 1 2 2 3 3 3 4 
Rank of |D| - 2 2 2 45 45 7 7 7 9 


The sum of the ranks that have a positive sign (i.e., 2R+ = 40.5) and the sum of the ranks 
that have a negative sign (i.e., XR- = 4.5) are recorded at the bottom of Column 6 in Table 18.1. 
Equation 18.1 (which is identical to Equation 6.1) allows one to check the accuracy of these 
values. If the relationship indicated by Equation 18.1 is not obtained, it indicates an error has 
been made in the calculations. 


Xd = dU 
2 


(Equation 18.1) 


Employing the values ©R+ = 40.5 and XR- = 40.5 in Equation 18.1, we confirm that the 
relationship described by the equation is true. 


40.5 + 4,5 - OOD - as 


V. Interpretation of the Test Results 


As noted in Section III, if the sample is derived from a population in which the median of the 
difference scores equals zero, the values of XR+ and XR- will be equal to one another. When 
XR-« and XR- are equivalent, both of these values will equal [n(n + 1)]/4, which in the case of 
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Example 18.1 will be [(9)(10)]/4 = 22.5. This latter value is commonly referred to as the 
expected value of the Wilcoxon T statistic. 

If the value of /-R+ is significantly greater than the value of XR, it indicates there is a 
high likelihood that Condition 1 represents a population with higher scores than the population 
represented by Condition 2. On the other hand, if XR- is significantly greater than XR», it 
indicates there is a high likelihood that Condition 2 represents a population with higher scores 
than the population represented by Condition 1. Table 18.1 reveals that R+ = 40.5 is greater 
than XR- = 4.5, and thus the data are consistent with the directional alternative hypothesis 
H: 0, > 0 (i.e. itindicates that subjects obtained higher scores in Condition 1 than Condition 
2). The question is, however, whether the difference is significant — i.e., whether it is large 
enough to conclude that it is unlikely to be the result of chance. 

The absolute value of the smaller of the two values XR+ versus XR- is designated as the 
Wilcoxon T test statistic. Since ER- = 4.5 is smaller than ©R+ = 40.5, T = 4.5. The T value 
is interpreted by employing Table A5 (Table of Critical 7 Values for Wilcoxon's Signed- 
Ranks and Matched-Pairs Signed-Ranks Tests) in the Appendix. Table A5 lists the critical 
two-tailed and one-tailed .05 and .01 T values in relation to the number of signed ranks in a set 
of data. In order to be significant, the obtained value of 7 must be equal to or less than the 
tabled critical T value at the prespecified level of significance.) Table 18.3 summarizes the tabled 
critical two-tailed and one-tailed .05 and .01 Wilcoxon T values for n = 9 signed ranks. 


Table 18.3 Tabled Critical Wilcoxon T Values for n = 9 Signed Ranks 


Tos To 
Two-tailed values 5 1 
One-tailed values 8 3 


Since the null hypothesis can only be rejected if the computed value T = 4.5 is equal to or 
less than the tabled critical value at the prespecified level of significance, we can conclude the 
following. 

In order for the nondirectional alternative hypothesis H,: 0, # 0 to be supported, it 
is irrelevant whether R+ > XR- or XR- > XR«. In order for the result to be significant, 
the computed value of T must be equal to or less than the tabled critical two-tailed value at 
the prespecified level of significance. Since the computed value T = 4.5 is less than the tabled 
critical two-tailed .05 value T, = 5, the nondirectional alternative hypothesis H,: 8, 4 0 is 
supported at the .05 level. It is not, however, supported at the .01 level, since T = 4.5 is greater 
than the tabled critical two-tailed .01 value Tọ = 1. 

In order for the directional alternative hypothesis H,: 0, > 0 to be supported, YXR* must 
be greater than YXR-. Since XR« > XR-, the data are consistent with the directional alternative 
hypothesis H,: 0, > 0. In order for the result to be significant, the computed value of T must 
be equal to or less than the tabled critical one-tailed value at the prespecified level of 
significance. Since the computed value T= 4.5 is less than the tabled critical one-tailed .05 value 
T, = 8, the directional alternative hypothesis H,: 0 > 0 is supported at the .05 level. It is 
not, however, supported at the .01 level, since T= 4.5 is greater than the tabled critical one-tailed 
.01 value T,, = 3. 

In order for the directional alternative hypothesis H,: 0, < 0 tobe supported, the follow- 
ing two conditions must be met: a) ER- must be greater than XR+; and b) The computed value 
of T must be equal to or less than the tabled critical one-tailed value at the prespecified level of 
significance. Since the first of these conditions is not met, the directional alternative hypothesis 
H, : 05 < O is not supported. 
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A summary of the analysis of Example 18.1 with the Wilcoxon matched-pairs signed- 
ranks test follows: Itcan be concluded that subjects exhibited higher GSR (emotionality) scores 
with respect to the sexually explicit words than the neutral words. 

The results obtained with the Wilcoxon matched-pairs signed-ranks test are reasonably 
consistent with those obtained when the t test for two dependent samples is employed to eval- 
uate the same set of data. In the case of both tests, the analogous nondirectional alternative 
hypotheses H,: 0, # 0 and H,: p, * p, are supported, but only at the .05 level. In the case 
of the Wilcoxon matched-pairs signed-ranks test, the directional alternative hypothesis 
H,: 8, > O is only supported at the .05 level, whereas the analogous directional alternative 
hypothesis H,: u, > m, is supported at both the .05 and .01 levels when the data are evaluated 
with the £ test for two dependent samples. The latter discrepancy between the two tests reflects 
the fact that when a parametric and nonparametric test are applied to the same set of data, the 
parametric test will generally provide a more powerful test of an alternative hypothesis. In most 
instances, however, similar conclusions will be reached if the same data are evaluated with the 
t test for two dependent samples and the Wilcoxon matched-pairs signed-ranks test. 


VI. Additional Analytical Procedures for the Wilcoxon Matched- 
Signed-Ranks Test and/or Related Tests 


1. The normal approximation of the Wilcoxon T statistic for large sample sizes As is the 
case with the Wilcoxon signed-ranks test, if the sample size employed in a study is relatively 
large, the normal distribution can be employed to approximate the Wilcoxon T statistic. 
Although sources do not agree on the value of the sample size that justifies employing the normal 
approximation of the Wilcoxon distribution, they generally state that it should be employed for 
sample sizes larger than those documented in the Wilcoxon table contained within the source. 
Equation 18.2 (which is identical to Equation 6.2) provides the normal approximation for 
Wilcoxon 7. In the equation T represents the computed value of Wilcoxon T, which for Example 
18.1 is T 2 4.5. n, as noted previously, represents the number of signed ranks. Thus, in our 
example, n 2 9. Note that in the numerator of Equation 18.2, the term [n(n + 1)]/4 represents the 
expected value of T (often summarized with the symbol 77), which is defined in Section V. The 
denominator of Equation 18.2 represents the expected standard deviation of the sampling 
distribution of the T statistic. 





T- n(n + 1) 
Zo 4 (Equation 18.2) 
n(n + 1)(2n + 1) 
24 


Although Example 18.1 involves only nine signed ranks (a value most sources would view 
as too small to use with the normal approximation), it will be employed to illustrate Equation 
18.2. The reader will see that in spite of employing Equation 18.2 with a small sample size, it 
will yield essentially the same result as that obtained when the exact table of the Wilcoxon 
distribution is employed. When the values T = 4.5 and n = 9 are substituted in Equation 18.2, 
the value z = —2.13 is computed. 


4s. (0) 


fe E 


(9)(10)(19) 
24 
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The obtained value z = —2.13 is evaluated with Table A1 (Table of the Normal Dis- 

tribution) in the Appendix. In Table A1 the tabled critical two-tailed .05 and .01 values are Zo5 = 1.96 
and Z,, = 2.58, and the tabled critical one-tailed .05 and .01 values are z,, = 1.65 and 
= 2.33. 
Since the smaller of the two values ©R+ versus XR- is selected to represent T, the value 
of z computed with Equation 18.2 will always be a negative number (unless R+ = XR-, in 
which case z will equal zero). This is the case, since by selecting the smaller value T will always 
be less than the expected value T,. As a result of this, the following guidelines are employed 
in evaluating the null hypothesis. 

a) If a nondirectional alternative hypothesis is employed, the null hypothesis can be re- 
jected if the obtained absolute value of z is equal to or greater than the tabled critical two-tailed 
value at the prespecified level of significance. 

b) When a directional alternative hypothesis is employed, one of the two possible 
directional alternative hypotheses will be supported if the obtained absolute value of z is equal 
to or greater than the tabled critical one-tailed value at the prespecified level of significance. 
Which alternative hypothesis is supported depends on the prediction regarding which of the two 
values X£R* versus ÈR- is larger. The null hypothesis can only be rejected if the directional 
alternative hypothesis that is consistent with the data is supported. 

Employing the above guidelines, when the normal approximation is employed with 
Example 18.1 the following conclusions can be reached. 

The nondirectional alternative hypothesis H,: 0, # 0 is supported at the .05 level. This 
is the case, since the computed absolute value z = 2.13 is greater than the tabled critical two- 
tailed .05 value Z; = 1.96. The nondirectional alternative hypothesis H,: 8, # 0 is not 
supported at the .01 level, since the absolute value z = 2.13 is less than the tabled critical two- 
tailed .01 value zo, = 2.58. This decision is consistent with the decision that is reached when 
the exact table of the Wilcoxon distribution is employed to evaluate the nondirectional alternative 
hypothesis H,: 0, # 0. 

The directional alternative hypothesis H,: 0 > 0 is supported at the .05 level. This is 
the case, since the data are consistent with the latter alternative hypothesis (i.e., ZR+ > ÈR-), and 
the computed absolute value z = 2.13 is greater than the tabled critical one-tailed .05 value 
Zos = 1.65. The directional alternative hypothesis H,: 8, > O is not supported at the .01 
level, since the obtained absolute value z = 2.13 is less than the tabled critical one-tailed .01 value 
Zo, = 2.33. This decision is consistent with the decision that is reached when the exact table 
of the Wilcoxon distribution is employed to evaluate the directional alternative hypothesis 
H: 0p > 0. 

The directional alternative hypothesis H,: 0, < 0 is not supported, since the data are 
not consistent with the latter alternative hypothesis (which requires that ER- > R+). 

It should be noted that, in actuality, either XR+ or XR- can be employed to represent the 
value of T in Equation 18.2. Either value will yield the same absolute value for z. The smaller 
of the two values will always yield a negative z value, and the larger of the two values will 
always yield a positive z value (which in this instance will be z= 2.13 if R+ = 40.5 is employed 
to represent 7). In evaluating a nondirectional alternative hypothesis the sign of z is irrelevant. 
In the case of a directional alternative hypothesis, one must determine whether the data are 
consistent with the alternative hypothesis that is stipulated. If the data are consistent, one then 
determines whether the absolute value of z is equal to or greater than the tabled critical one-tailed 
value at the prespecified level of significance. 


Z o1 


2. The correction for continuity for the normal approximation of the Wilcoxon matched- 
pairs signed-ranks test As noted in the discussion of the Wilcoxon signed-ranks test, a 
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correction for continuity can be employed for the normal approximation of the Wilcoxon test 
statistic. The same correction for continuity can be applied to the Wilcoxon matched-pairs 
signed-ranks test. The correction for continuity (which results in a slight reduction in the 
absolute value computed for z) requires that .5 be subtracted from the absolute value of the 
numerator of Equation 18.2. Thus, Equation 18.3 (which is identical to Equation 6.3) represents 
the continuity-corrected normal approximation of the Wilcoxon test statistic. 








F- nn +1) D| -5 
gel- 4 I- (Equation 18.3) 
E + 1)Qn + 1) 
24 


Employing Equation 18.3, the continuity-corrected value z = 2.07 is computed. Note that 
as a result of the absolute value conversion, the numerator of Equation 18.3 will always be a 
positive number, thus yielding a positive z value. 


|ss - exo T 
pet 59... 2597 
(9)(10)(19) 
24 


The result of the analysis with Equation 18.3 leads to the same conclusions that are reached 
with Equation 18.2 (i.e., when the correction for continuity is not employed). Specifically, since 
the absolute value z = 2.07 is greater than the tabled critical two-tailed .05 value z,, = 1.96, 
the nondirectional alternative hypothesis H,: 0, # 0 is supported at the .05 level (but not at 
the .01 level). Since the absolute value z = 2.07 is greater than the tabled critical one-tailed .05 
value zo, = 1.65, the directional alternative hypothesis H,: 8, > O is supported at the .05 
level (but not at the .01 level). 


3. Tie correction for the normal approximation of the Wilcoxon test statistic Equation 18.4 
(which is identical to Equation 6.4) is an adjusted version of Equation 18.2 that is recommended 
in some sources (e.g., Daniel (1990) and Marascuilo and McSweeney (1977)) when tied dif- 
ference scores are present in the data. The tie correction (which is identical to the one described 
for the Wilcoxon signed-ranks test) results in a slight increase in the absolute value of z. Unless 
there are a substantial number of ties, the difference between the values of z computed with 
Equations 18 .2 and 18.4 will be minimal. 








T- n(n + 1) 
Z = a (Equation 18.4) 
EE -DQn«1) Ert -Xt 
24 48 


Table 18.4 illustrates the application of the tie correction with Example 18.1. In the data 
for Example 18.1 there are three sets of tied ranks: Set 1 involves three subjects (Subjects 1, 9, 
and 10); Set 2 involves two subjects (Subjects 3 and 4); Set 3 involves three subjects (Subjects 
5, 7, and 8). The number of subjects involved in each set of tied ranks represents the values of 
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tin the third column of Table 18.4. The three ¢ values are cubed in the last column of the table, 
after which the values X and X1? are computed. The appropriate values are now substituted 
in Equation 18.4.° 


Table 18.4 Correction for Ties with Normal Approximation 


Subject Rank t t? 
1 2 
9 2 3 27 
10 2 
3 4.5 | 
4 4.5 2 3 
5 7 
7 7 3 27 
8 7 
6 9 


Yr =8 Yr? = 62 





4.5 - (900 
Z= 4 - -2.15 
(9)(10)(19)  Á 62-8 
24 48 


The absolute value z = 2.15 is slightly larger than the absolute value z = 2.13 obtained 
without the tie correction. The difference between the two methods is trivial, and in this instance, 
regardless of which alternative hypothesis is employed, the decision the researcher makes with 
respect to the null hypothesis is not affected. 

Conover (1980, 1999) and Daniel (1990) discuss and/or cite sources on the subject of 
alternative ways of handling tied difference scores. Conover (1980, 1999) also notes that in 
some instances retaining and ranking zero difference scores may actually provide a more 
powerful test of an alternative hypothesis than the more conventional method employed in this 
book (which eliminates zero difference scores from the data). 


4. Sources for computing a confidence interval for the Wilcoxon matched-pairs signed 
ranks test Conover (1980, 1999), Daniel (1990), and Marascuilo and McSweeney (1977) de- 
scribe procedures for computing a confidence interval for the Wilcoxon matched-pairs signed- 
ranks test — i.e., computing a range of values within which a researcher can be confident to a 
specified degree (or that the probability is) that a difference between two population medians 
falls. 


VII. Additional Discussion of the Wilcoxon Matched-Pairs 
Signed-Ranks Test 


1. Power-efficiency of the Wilcoxon matched-pairs signed-ranks test When the underlying 
population distributions are normal, the asymptotic relative efficiency (which is discussed in 
Section VII of the Wilcoxon signed-ranks test) of the Wilcoxon matched-pairs signed-ranks 
test is .955 (when contrasted with the ¢ test for two dependent samples). For population 
distributions that are not normal, the asymptotic relative efficiency of the Wilcoxon matched- 
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pairs signed-ranks test is generally equal to or greater than 1. As a general rule, proponents of 
nonparametric tests take the position that when a researcher has reason to believe that the 
normality assumption of the ¢ test for two dependent samples has been saliently violated, the 
Wilcoxon matched-pairs signed-ranks test provides a powerful test of the comparable alter- 
native hypothesis. 


2. Alternative nonparametric procedures for evaluating a design involving two dependent 
samples In addition to the Wilcoxon matched-pairs signed-ranks test, the binomialsign test 
for two dependent samples (Test 19) (which is described in the next chapter) can be employed 
to evaluate a design involving two dependent samples. Marascuilo and McSweeney (1977) de- 
scribe the extension of the van der Waerden normal-scores test for k independent samples 
(Test 23) (Van der Waerden (1953/1953) (which is discussed later in the book) to a design 
involving k dependent samples. Normal-scores tests are procedures which involve transformation 
of ordinal data through use of the normal distribution. Conover (1980, 1999) notes that a normal- 
scores test developed by Bell and Doksum (1965) can be extended to a dependent samples design. 
Another procedure that can be employed with a dependent samples design is Fisher's random- 
ization procedure (Fisher (1935)) (which is described in Conover (1980, 1999), Marascuilo and 
McSweeney (1977) and Siegel and Castellan (1988)). The randomization test for two inde- 
pendent samples (Test 12a) (which is described in Section IX (the Addendum) of the Mann- 
Whitney U test (Test 12)) illustrates the use of Fisher's randomization procedure with two 
independent samples. Additional nonparametric procedures that can be employed with a k 
dependent samples design are either discussed or referenced in Conover (1980, 1999), Daniel 
(1990), Hollander and Wolfe (1999), Marascuilo and McSweeney (1977), and Sheskin (1984). 


VIII. Additional Examples Illustrating the Use of the Wilcoxon 
Matched-Pairs Signed-Ranks Test 


The Wilcoxon matched-pairs signed-ranks test can be employed to evaluate any of the addi- 
tional examples noted for the ¢ test for two dependent samples (i.e., Examples 17.2— 17.7). In 
all instances in which the Wilcoxon matched-pairs signed-ranks test is employed, difference 
scores are obtained for subjects (or pairs of matched subjects). All difference scores are then 
ranked and evaluated in accordance with the ranking protocol described in Section IV. 
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Endnotes 


1. Some sources note that one assumption of the Wilcoxon matched-pairs signed-ranks test 
is that the variable being measured is based on a continuous distribution. In practice, how- 
ever, this assumption is often not adhered to. 


2. When there are tied scores for either the lowest or highest difference scores, as a result of 
averaging the ordinal positions of the tied scores, the rank assigned to the lowest difference 
score will be some value greater than 1, and the rank assigned to the highest difference score 
will be some value less than n. 


3. A more thorough discussion of Table A5 can be found in Section V of the Wilcoxon signed- 
ranks test. 


4. The concept of power efficiency is discussed in Section VII of the Mann-Whitney U test. 


5. The term (Xt? - Xi) in Equation 18.4 can also be written as X; (fj - t). The latter 
notation indicates the following: a) For each set of ties, the number of ties in the set is 
subtracted from the cube of the number of ties in that set; and b) the sum of all the values 
computed in part a) is obtained. Thus, in the example under discussion (in which there are 
S = 3 sets of ties): 


EE -:5 = IBP - 3] + IQ? - 2] + [Gy - 3] = 54 
i-1 


The computed value of 54 is the same as the corresponding value (Xt? - Xf) = 62 — 8 
= 54 computed in Equation 18.4 through use of Table 18.4. 


6. Acorrection for continuity can be used in conjunction with the tie correction by subtracting 


.5 from the absolute value computed for the numerator of Equation 18.4. Use of the correc- 
tion for continuity will reduce the tie corrected absolute value of z. 
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Test 19 


The Binomial Sign Test for Two Dependent Samples 
(Nonparametric Test Employed with Ordinal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Do two dependent samples represent two different popula- 
tions? 


Relevant background information on test The binomial sign test for two dependent 
samples is essentially an extension of the binomial sign test for a single sample (Test 9) to a 
design involving two dependent samples. Since a complete discussion of the binomial 
distribution (which is the distribution upon which the test is based) is contained in the discussion 
of the binomial sign test for a single sample, the reader is advised to read the material on the 
latter test prior to continuing this section. Whenever one or more of the assumptions of the t test 
for two dependent samples (Test 17) or the Wilcoxon matched-pairs signed-ranks test (Test 
18) are saliently violated, the binomialsign test for a two dependent samples can be employed 
as an alternative procedure. The reader should review the assumptions of the aforementioned 
tests, as well as the information on a dependent samples design discussed in Sections I and VII 
of the £ test for two dependent samples. 

To employ the binomial sign test for two dependent samples, it is required that each of 
n subjects (or n pairs of matched subjects) has two scores (each score having been obtained under 
one of the two experimental conditions). The two scores are represented by the notations X, 
and X,. For each subject (or pair of matched subjects), a determination is made with respect 
to whether a subject obtains a higher score in Condition 1 or Condition 2. Based on the latter, 
a signed difference (D-- or D—) is assigned to each pair of scores. The sign of the difference 
assigned to a pair of scores will be positive if a higher score is obtained in Condition 1 (i.e., D+ 
if X, > X,), whereas the sign of the difference will be negative if a higher score is obtained 
in Condition 2 (i.e., D- if X, > Xj). The hypothesis the binomial sign test for two depen- 
dent samples evaluates is whether or not in the underlying population represented by the sample, 
the proportion of subjects who obtain a positive signed difference (i.e., obtain a higher score in 
Condition 1) is some value other than .5. If the proportion of subjects who obtain a positive 
signed difference (which, for the underlying population, is represented by the notation 7+) is 
some value that is either significantly above or below .5, it indicates there is a high likelihood the 
two dependent samples represent two different populations. 

The binomial sign test for two dependent samples is based on the following assumptions:! 
a) The sample of n subjects has been randomly selected from the population it represents; and b) 
The format of the data is such that within each pair of scores the two scores can be rank-ordered. 

As is the case for the t test for two dependent samples and the Wilcoxon matched-pairs 
signed-ranks test, in order for the binomial sign test for two dependent samples to generate 
valid results, the following guidelines should be adhered to: a) To control for order effects, 
the presentation of the two experimental conditions should be random or, if appropriate, be 
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counterbalanced; and b) If matched samples are employed, within each pair of matched subjects 
each of the subjects should be randomly assigned to one of the two experimental conditions. 

As is the case with the t test for two dependent samples and the Wilcoxon matched-pairs 
signed-ranks test, the binomial sign test for two dependent samples can also be employed to 
evaluate a before-after design. The limitations of the before-after design (which are discussed 
in Section VII of the ¢ test for two dependent samples) are also applicable when it is evaluated 
with the binomial sign test for two dependent samples. 


II. Example 


Example 19.1 is identical to Examples 17.1 and 18.1 (which are, respectively, evaluated with the 
t test for two dependent samples and the Wilcoxon matched-pairs signed-ranks test). In 
evaluating Example 19.1 it will be assumed that the binomial sign test for two dependent 
samples is employed, since one or more of the assumptions of the ¢ test for two dependent 
samples and the Wilcoxon matched-pairs signed-ranks test have been saliently violated. 


Example 19.1 A psychologist conducts a study to determine whether or not people exhibit 
more emotionality when they are exposed to sexually explicit words than when they are exposed 
to neutral words. Each of ten subjects is shown a list of 16 randomly arranged words which are 
projected onto a screen one at a time for a period of five seconds. Eight of the words on the list 
are sexually explicit in nature and eight of the words are neutral. As each word is projected on 
the screen, a subject is instructed to say the word softly to him or herself. As a subject does this, 
sensors attached to the palms of the subject’s hands record galvanic skin response (GSR), which 
is used by the psychologist as a measure of emotionality. The psychologist computes two scores 
for each subject, one score for each of the experimental conditions: Condition 1: GSR/Explicit 
— The average GSR score for the eight sexually explicit words; Condition 2: GSR/Neutral — 
The average GSR score for the eight neutral words. The GSR/Explicit and the GSR/Neutral 
scores of the ten subjects follow. (The higher the score, the higher the level of emotionality.) 
Subject 1 (9, 8); Subject 2 (2, 2); Subject 3 (1, 3); Subject 4 (4, 2); Subject 5 (6, 3); Subject 
6 (4, 0); Subject 7 (7, 4); Subject 8 (8, 5); Subject 9 (5, 4); Subject 10 (1, 0). Do subjects 
exhibit differences in emotionality with respect to the two categories of words? 


III. Null versus Alternative Hypotheses 


Null hypothesis Hy: T+ = .5 


(In the underlying population the sample represents, the proportion of subjects who obtain a 
positive signed difference (i.e., a higher score in Condition 1 than Condition 2) equals .5.) 


Alternative hypothesis H, m+ # 5 


(In the underlying population the sample represents, the proportion of subjects who obtain a 
positive signed difference (i.e., a higher score in Condition 1 than Condition 2) does not equal 
.5. This is a nondirectional alternative hypothesis, and it is evaluated with a two-tailed test. 
In order to be supported, the observed proportion of positive signed differences in the sample 
data (which will be represented with the notation p+) can be either significantly larger than the 
hypothesized population proportion n+ = .5 or significantly smaller than n+ = .5.) 


or 


Ay: m+ 5 
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(In the underlying population the sample represents, the proportion of subjects who obtain a 
positive signed difference (i.e., a higher score in Condition 1 than Condition 2) is greater than 
.5. This is a directional alternative hypothesis, and it is evaluated with a one-tailed test. In 
order to be supported, the observed proportion of positive signed differences in the sample data 
must be significantly larger than the hypothesized population proportion zt = .5.) 


or 
Ay: m+ < .5 


(In the underlying population the sample represents, the proportion of subjects who obtain a 
positive signed difference (i.e., a higher score in Condition 1 than Condition 2) is less than .5. 
This is a directional alternative hypothesis, and it is evaluated with a one-tailed test. In order 
to be supported, the observed proportion of positive signed differences in the sample data must 
be significantly smaller than the hypothesized population proportion n+ = .5.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected.” 


IV. Test Computations 


The data for Example 19.1 are summarized in Table 19.1. Note that there are 10 subjects and 
that each subject has two scores. 


Table 19.1 Data for Example 19.1 


Subject X, X, D-X,-X, Signed Difference 

1 9 8 1 t 

2 2 2 0 0 

3 1 3 -2 - 

4 4 2 2 + 

5 6 3 3 + 

6 4 0 4 + 

7 7 4 3 + 

8 8 5 3 + 

9 5 4 1 + 

10 1 0 1 + 
UD+ =8 
YD- =1 


The following information can be derived from Table 19.1: a) Eight subjects (Subjects 1, 
4, 5, 6, 7, 8, 9, 10) yield a difference score with a positive sign — i.e., a positive signed 
difference; b) One subject (Subject 3) yields a difference score with a negative sign — 1.e., a 
negative signed difference; and c) One subject (Subject 2) obtains the identical score in both 
conditions, and as a result of this yields a difference score of zero. 

As is the case with the Wilcoxon matched-pairs signed-ranks test, in employing the 
binomial sign test for two dependent samples, any subject who obtains a zero difference score 
is eliminated from the data analysis. Since Subject 2 falls in this category, the size of the sample 
is reduced to n = 9, which is the same number of signed ranks employed when the Wilcoxon 
matched-pairs signed-ranks test is employed to evaluate the same set of data. 

The sampling distribution of the signed differences represents a binomially distributed vari- 
able with an expected probability of .5 for each of the two mutually exclusive categories (i.e., 
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positive signed difference versus negative signed difference). The logic underlying the binomial 
sign test for two dependent samples is that if the two experimental conditions represent equiv- 
alent populations, the signed differences should be randomly distributed. Thus, assuming that 
subjects who obtain a difference score of zero are eliminated from the analysis, if the remaining 
signed differences are, in fact, randomly distributed, one-half of the subjects should obtain a 
positive signed difference and one-half of the subjects should obtain a negative signed difference. 
In Example 19.1 the observed proportion of positive signed differences is p+ = 8/9 = .89 and 
the observed proportion of negative signed differences is p- = 1/9 = .11. 

Equation 19.1 (which is identical to Equation 9.5, except for the fact that 7+ and n- are used 
in place of m, and m,) is employed to determine the probability of obtaining x = 8 or 
more positive signed differences in a set of n = 9 scores. 


P x) = »» | "| (n4) (n-)" - 9 (Equation 19.1) 


r-x 


Where: n+ and z-, respectively, represent the hypothesized values for the proportion of 
positive and negative signed differences 
n represents the number of signed differences 
x represents the number of positive signed differences 


In employing Equation 19.1 with Example 19.1, the following values are employed: 
a) n+ = .5 and t- = .5, since if the null hypothesis is true, the proportion of positive and negative 
signed differences should be equal. Note that the sum of n+ and n- must always equal 1; 
b) n = 9, since there are 9 signed differences; and c) x = 8, since 8 subjects obtain a positive 
signed difference. 

The notation 27. , in Equation 19.1 indicates that the probability of obtaining a value of 
x equal to the observed number of positive signed differences must be computed, as well as the 
probability for all values of x greater than the observed number of positive signed differences up 
through and including the value of n. Thus, in the case of Example 19.1, the binomial probability 
must be computed for the values x 2 8 and x 2 9. Equation 19.1 is employed below to compute 
the latter probability. The obtained value .0195 represents the likelihood of obtaining 8 or more 
positive signed differences in a set of n = 9 signed differences. 


P(x > 8) = | 4 (.5)8 (.5)! + [s] C5P (.5)° = .0195 


An even more efficient way of obtaining the probability P(8 or 9/9) = .0195 is through use 
of Table A7 (Table of the Binomial Distribution, Cumulative Probabilities) in the Appendix. 
In employing Table A7 we find the section for n = 9, and locate the cell that is the intersection 
of the row x = 8 and the column 1 = .5. The entry .0195 in that cell represents the probability of 
obtaining 8 or more (i.e., 8 and 9) positive signed differences, if there are a total of 9 signed 
differences.’ 

Equation 19.2 (which is identical to Equation 9.3 employed for the binomial sign test for 
a single sample, except for the fact that n+ and z— are employed in place of T, and 7) can be 
employed to compute each of the individual probabilities that are summed in Equation 19.1. 


P(X) = | I Gu (je (Equation 19.2) 
x 
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Since the computation of binomial probabilities can be quite tedious, in lieu of employing 
Equation 19.2, Table A6 (Table of the Binomial Distribution, Individual Probabilities) in the 
Appendix can be used to determine the appropriate probabilities. In employing Table A6 we 
find the section for n = 9, and locate the cell that is the intersection of the row x = 8 and the 
column z= .5. The entry .0176 in that cell represents the probability of obtaining exactly 8 
positive signed differences, if there are a total of 9 signed differences. Additionally, we locate 
the cell that is the intersection of the row x = 9 and the column z= .5. The entry .0020 in that 
cell represents the probability of obtaining exactly 9 positive signed differences, if there are a 
total of 9 signed differences. Summing the latter two values yields the value P(8 or 9/9) = .0196, 
which is the likelihood of observing 8 or 9 positive signed differences in a set of n 2 9 signed dif- 
ferences.’ For a comprehensive discussion on the computation of binomial probabilities and the 
use of Tables A6 and A7, the reader should review Section IV of the binomial sign test for a 
single sample. 


V. Interpretation of the Test Results 


The following guidelines are employed in evaluating the null hypothesis. 

a) If anondirectional alternative hypothesis is employed, the null hypothesis can be rejected 
if the probability of obtaining a value equal to or more extreme than x is equal to or less than o/2 
(where a represents the prespecified value of a). The reader should take note of the fact that if 
the proportion of positive signed differences in the data (i.e., p+) is greater than 1+ = .5, a value 
that is more extreme than x will be any value that falls above the observed value of x, whereas 
if the proportion of positive signed differences in the data is less than n+ = .5, a value that is more 
extreme than x will be any value that falls below the observed value of x. 

b) If a directional alternative hypothesis is employed which predicts that the underlying 
population proportion is above the hypothesized value n+ = .5, in order to reject the null 
hypothesis both of the following conditions must be met: 1) The proportion of positive signed 
differences must be greater than the value 1+ = .5 stipulated in the null hypothesis; and 2) The 
probability of obtaining a value equal to or greater than x is equal to or less than the prespecified 
value of a. 

C) If a directional alternative hypothesis is employed which predicts that the underlying 
population proportion is below the hypothesized value n+ = .5, in order to reject the null 
hypothesis both of the following conditions must be met: 1) The proportion of positive signed 
differences must be less than the value n+ = .5 stipulated in the null hypothesis; and 2) The 
probability of obtaining a value equal to or less than x is equal to or less than the prespecified 
value of a. 

Applying the above guidelines to Example 19.1, we can conclude the following. 

The nondirectional alternative hypothesis H: m+ * .5 is supported at the a = .05 level, 
since the obtained probability .0195 is less than o/2 = .05/2 = .025. The nondirectional al- 
ternative hypothesis H,: m+ # .5 is not supported at the a = .01 level, since the probability 
.0195 is greater than o/2 = .01/2 = .005.° 

The directional alternative hypothesis H: m+ > .5 is supported at the a = .05 level. This is 
the case because: a) The data are consistent with the directional alternative hypothesis H,: m+ 
> 5, since p+ = .89 is greater than the value 1+ = .5 stated in the null hypothesis; and b) The 
obtained probability .0195 is less than a = .05. The directional alternative hypothesis H,: m+ 
> .5 is not supported at the a = .01 level, since the probability .0195 is greater than a = .01. 

The directional alternative hypothesis H;: m+ < .5 is not supported, since the data are 
not consistent with it. Specifically, p+ = .89 does not meet the requirement of being less than the 
value 1+ = .5 stated in the null hypothesis. 
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A summary of the analysis of Example 19.1 with the binomial sign test for two dependent 
samples follows: It can be concluded that subjects exhibited higher GSR (emotionality) scores 
with respect to the sexually explicit words than the neutral words. 

When the binomial sign test for two dependent samples and the Wilcoxon matched- 
pairs signed-ranks test are applied to the same set of data, the two tests yield identical 
conclusions. Specifically, both tests support the nondirectional alternative hypothesis and the 
directional alternative hypothesis that is consistent with the data at the .05 level. Although it is 
not immediately apparent from this example, as a general rule, when applied to the same data the 
binomial sign test for two dependent samples tends to be less powerful than the Wilcoxon 
matched-pairs signed-ranks test. This is the case, since by not considering the magnitude of 
the difference scores, the binomial sign test for two dependent samples employs less 
information than the Wilcoxon matched-pairs signed-ranks test. As is the case with the 
Wilcoxon matched-pairs signed-ranks test, the binomial sign test for two dependent samples 
utilizes less information than the ¢ test for two dependent samples, and thus in most instances, 
it will provide a less powerful test of an alternative hypothesis than the latter test. In point of 
fact, in the case of Example 19.1, if the data are evaluated with the ¢ test for two dependent 
samples, the directional alternative hypothesis that is consistent with the data is supported at both 
the .05 and .01 levels. It should be noted, however, that if the normality assumption of the ¢ test 
for two dependent samples is saliently violated, in some instances the binomial sign test for 
two dependent samples may provide a more powerful test of an analogous alternative 
hypothesis. 


VI. Additional Analytical Procedures for the Binomial Sign Test 
for Two Dependent Samples and/or Related Tests 


1. The normal approximation of the binomial sign test for two dependent samples with and 
without a correction for continuity With large sample sizes the normal approximation for the 
binomial distribution (which is discussed in Section VI of the binomial sign test for a single 
sample) can provide a large sample approximation for the binomial sign test for two dependent 
samples. As a general rule, most sources recommend employing the normal approximation for 
sample sizes larger than those documented in the table of the binomial distribution contained in 
the source. Equation 19.3 (which is equivalent to Equation 9.7) is the normal approximation 
equation for the binomial sign test for two dependent samples. When a correction for con- 
tinuity is used, Equation 19.4 (which is equivalent to Equation 9.9) is employed.* 


z- bs nm (Equation 19.3) 
(n)(m+)(1-) 


z- He Mey! = 5 (Equation 19.4) 
(n)(n *)(n-) 


Although Example 19.1 involves only nine signed ranks (a value most sources would view 
as too small to use with the normal approximation), it will be employed to illustrate Equations 
19.3 and 19.4. The reader will see that in spite of employing the normal approximation with a 
small sample size, it yields essentially the same results as those obtained with the exact binomial 
probabilities. 

Employing Equation 19.3, the value z 2 2.33 is computed. 
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- 8- OM) .233 
(9)-5)(.5) 


Z 


Employing Equation 19.4, the value z = 2.00 is computed. 


z= |8 T (9)(.5) | - - 2.00 
(9)(.5)(.5) 


The obtained z values are evaluated with Table A1 (Table of the Normal Distribution) 
in the Appendix. In Table A1 the tabled critical two-tailed .05 and .01 values are zy; = 1.96 
and zy, = 2.58, and the tabled critical one-tailed .05 and .01 values are z, = 1.65 and 
- 2.33. The following guidelines are employed in evaluating the null hypothesis. 

a) If anondirectional alternative hypothesis is employed, the null hypothesis can be rejected 
if the obtained absolute value of z is equal to or greater than the tabled critical two-tailed value 
at the prespecified level of significance. 

b) If a directional alternative hypothesis is employed, only the directional alternative 
hypothesis that is consistent with the data can be supported. With respect to the latter alternative 
hypothesis, the null hypothesis can be rejected if the obtained absolute value of z is equal to or 
greater than the tabled critical one-tailed value at the prespecified level of significance. 

Employing the above guidelines, we can conclude the following. 

Since the value z 2 2.33 computed with Equation 19.3 is greater than the tabled critical two- 
tailed value zo, = 1.96 but less than the tabled critical two-tailed value zo, = 2.58, the non- 
directional alternative hypothesis H,: m+ # .5 is supported, but only at the .05 level. Since 
the value z = 2.33 is greater than the tabled critical one-tailed value z,, = 1.65 and equal to 
the tabled critical one-tailed value z,, = 2.33, the directional alternative hypothesis 
H,: n+ > .5 is supported at both the .05 and .01 levels. Note that when the exact binomial 
probabilities are employed, both the nondirectional alternative hypothesis H,: m+ # .5 and 
the directional alternative hypothesis H,: m+ > .5 are supported, but in both instances, only 
at the .05 level. 

When the correction for continuity is employed, the value z = 2.00 computed with Equation 
19.4 is greater than the tabled critical two-tailed value zo; = 1.96 but less than the tabled critical 
two-tailed value zo, = 2.58. Thus, the nondirectional alternative hypothesis H,: m+ # .5 is 
only supported at the .05 level. Since the value z = 2.00 is greater than the tabled critical one- 
tailed value zy, = 1.65 but less than the tabled critical one-tailed value z,, = 2.33, the 
directional alternative hypothesis H,: m+ > .5 is also supported, but only at the .05 level. 
The results with the correction for continuity are identical to those obtained when the exact 
binomial probabilities are employed. Note that the continuity-corrected normal approximation 
provides a more conservative test of the null hypothesis than does the uncorrected normal 
approximation. 

The chi-square goodness-of-fit test (Test 8) (either with or without the correction for 
continuity) can also be employed to provide a large sample approximation of the binomial sign 
test for two dependent samples. The chi-square goodness-of-fit test, which evaluates the 
relationship between the observed and expected frequencies in the two categories (i.e., positive 
signed difference versus negative signed difference), will yield a result that is equivalent to that 
obtained with the normal approximation. The computed chi-square value will equal the square 
of the z value derived with the normal approximation. 

Equation 8.2 is employed for the chi-square analysis, without using a correction for con- 
tinuity. Equation 8.6 is the continuity-corrected equation. Table 19.2 summarizes the analysis 
of Example 19.1 with Equation 8.2. 





£o 
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Table 19.2. Chi-Square Summary Table for Example 19.1 


(O, E Ey 

Cell O, E; (OQ;-E) (0;- Ey E 
Positive signed 

differences 8 4.5 3.5 12.25 2:72 
Negative signed 

differences 1 4.5 -3.5 12.25 2:72 

XO, =9 XE, -9 XO, -E) =0 X = 5.44 


In Table 19.2, the expected frequency E, = 4.5 for each cell is computed by multiply- 
ing the hypothesized population proportion for the cell (.5 for both cells) by n 2 9. Since k = 2, 
the Mad of freedom employed for the chi-square analysis are df = k — 1 = 2. The obtained 
value x? = 5.44 is evaluated with Table A4 (Table of the Chi-Square Distribution) i in the 
Appendix. For df= 1, the tabled critical .05 and .01 chi- square values are X. os = 3.84 (which 
corresponds to the chi-square value at the 95th percentile) and X. o1 = 6.63 ne corresponds 
to the chi-square value at the 99th percentile). Since the obtained value X? = 5.44 is greater 
than Xos = = 3.84 but less than Y = = 6.63, the ni p alternative hypothesis 
H: m+ * .5 is supported, but only at the .05 level. Since X? = 5.44 is greater than the tabled 
Gand one-tailed .05 value X. os = 2-71 (which corresponds to the chi-square value at the 90th 
percentile) and the tabled critical one-tailed .01 value X. o1 = 5.43 (which corresponds to the chi- 
square value at the 98th percentile), the directional alternative hypothesis H,: m+ > .5 is 
supported at both the .05 and .01 levels.’ The aforementioned conclusions are ana to those 
reached when Equation 19.3 is employed. 

As noted previously, if the z value obtained with Equation 19.3 is squared, it will always 
equal the chi-square value computed for the same data. Thus, in the current example where 
z = 2.33 and %? = 5.44, (2.33? = 5.43. 

Equation 8.6 (which, as noted previously, is the continuity-corrected equation for the chi- 
square goodness-of-fit test) is employed below, and yields an equivalent result to that obtained 
with Equation 19.4. In employing Equation 8.6, the value (|O, - E,| - .5) =3 is employed for 
each cell. Thus: 


k 


E 


i=1 


(JO, - E| - .5? 
E, 


l 


Or OY 4 
4.5 4.5 











Note that the obtained value x? = 4 is equal to the square of the value z = 2.00 obtained 
with Equation 19.4. Since X? = 4 is greater than Yrs - 3.84 but less than Lor = 6.63, the 
nandireronal alternative hypothesis H,: m+ * .5 is supported, hut only at the .05 level. 
Since X? = 4 is greater than the tabled critica one-tailed .05 value X. s = 2-71 but less than 
the tabled critical one-tailed .01 value Y = = 5.43, the directional alternative hypothesis 
H: m+ > .5 is also supported at only the .05 level. The aforementioned conclusions are 
identical to those reached when Equation 19.4 is employed. 


2. Computation of a confidence interval for the binomial sign test for two dependent 
samples Equation 19.5 (which is equivalent to Equation 8.5, except for the fact that p+, p- , and 
7.* are employed in place of p,, p,, and T) can be used to compute a confidence interval for 
the binomial sign test for two dependent samples. Since Equation 19.5 is based on the normal 
approximation, it should be employed with large sample sizes. It will, however, be used here with 
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the data for Example 19.1. The 95% confidence interval computed with Equation 19.5 estimates 
the proportion of positive signed differences in the underlying population. 


pt - Zia pups ptt KODIN xp? 


Employing the values computed for Example 19.1, Equation 19.5 is employed to compute 


the 9596 confidence interval. 
.89 + (1.96) aman 


< Tt < 


(Equation 19.5) 




















89 - (1.96) eo | oe 


T+ = .89 + .204 
.686 < m+ « 1.094 


Thus, the researcher can be 95% confident (or the probability is .95) that the true proportion 
of positive signed differences in the underlying population is a value between .686 and 1.094. 
Obviously, since a proportion cannot be greater than 1, the range of values identified by the 
confidence interval will fall between .686 and 1. 

It should be noted that the above method for computing a confidence interval ignores the 
presence of any zero difference scores. Consequently, the range of values computed for the con- 
fidence interval assumes there are no zero difference scores in the underlying population. If, in 
fact, there are zero difference scores in the population, the above computed confidence interval 
only identifies proportions that are relevant to the total number of cases in the population that are 
not zero difference scores. In point of fact, when one or more zero difference scores are present 
in the sample data, a researcher may want to assume that zero difference scores are present in the 
underlying population. If the researcher makes such an assumption and employs the sample data 
to estimate the proportion of zero difference scores in the population, the value employed for 
p+ in Equation 19.5 will represent the number of positive signed differences in the sample 
divided by the total number of scores in the sample, including any zero difference scores. Thus, 
in the case of Example 19.1, the value p+ = 8/10 = .8 = .8 is computed by dividing 8 (the number 
of positive signed differences) by n = 10. The value p- in Equation 19.5 will no longer 
represent just the negative signed differences, but will represent all signed differences that are 
not positive (i.e., both negative signed differences and zero difference scores). Thus, in the case 
of Example 19.1, p— = 2/10 = .2, since there is one negative signed difference and one zero 
difference score. If the values n = 10, p+ = .8, and p- = .2 are employed in Equation 19.5, the 
confidence interval .552 < m+ < 1.048 < 1.048 is computed. 


< 7+ < 











8 + (1.96) wa 


- (.8)(.2) 
.8 - (1.96) =o : 


1 
T+ = .8 + .248 
552 < n+ < 1.048 


Thus, the researcher can be 95% confident (or the probability is .95) that the true pro- 
portion of positive signed differences in the underlying population is a value between .552 and 
1.048. Since, as noted earlier, a proportion cannot be greater than 1, the range identified by 
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the confidence interval will fall between .552 and 1. If the researcher wants to employ the 
same method for computing a confidence interval for the proportion of minus signed differences 
in the population (i.e., 1), the product Zan [(p+)\(p-)]/n is added to and subtracted 
from p- = .1. The values p- = 1/10 = .1 and p+ = 9/10 = .9 are employed in the confidence 
interval equation, since there is one negative signed difference and nine signed differences that 
are not negative (i.e., eight positive signed differences and one zero difference score). 


3. Sources for computing the power of the binomial sign test for two dependent samples, 
and comments on asymptotic relative efficiency of the test Cohen (1977, 1988) has de- 
veloped a statistic called the g index that can be employed to compute the power of the binomial 
sign test for a single sample when H,: 1, = .5 is evaluated. The latter effect size index can 
be generalized to compute the power of the binomial sign test for two dependent samples. The 
g index represents the distance in units of proportion from the value .50. The equation Cohen 
(1977, 1988) employs for the g index is g = P — .50, where P represents the hypothesized value 
of the population proportion stated in the alternative hypothesis — in this instance it is assumed 
that the researcher has stated a specific value in the alternative hypothesis as an alternative to the 
value that is stipulated in the null hypothesis. 

Cohen (1977; 1988, Ch. 5) has derived tables that allow a researcher, through use of the g 
index, to determine the appropriate sample size to employ if one wants to test a hypothesis about 
the distance of a proportion from the value .5 at a specified level of power. Cohen (1977; 1988, 
pp. 147-150) has proposed the following (admittedly arbitrary) g values as criteria for identifying 
the magnitude of an effect size: a) A small effect size is one that is greater than .05 but not more 
than .15; b) A medium effect size is one that is greater than .15 but not more than .25; and c) A 
large effect size is greater than .25. 

Marascuilo and McSweeney (1977) note that if the underlying population distribution is 
normal, the asymptotic relative efficiency (which is discussed in Section VII of the Wilcoxon 
signed-ranks test (Test 6)) of the binomialsign test is .637, in contrast to an asymptotic relative 
efficiency of .955 for the Wilcoxon matched-pairs signed-ranks test (with both asymptotic 
relative efficiencies being in reference to the ¢ test for two dependent samples). When the 
underlying population distribution is not normal, in most cases, the asymptotic relative efficiency 
of the Wilcoxon matched-pairs signed-ranks test will be higher than the analogous value for 
the binomial sign test for two dependent samples. 


VII. Additional Discussion of the Binomial Sign Test for Two 
Dependent Samples 


1. The problem of an excessive number of zero difference scores When there is an excessive 
number of subjects who have a zero difference score in a set of data, a substantial amount of 
information is sacrificed if the binomial sign test for two dependent samples is employed to 
evaluate the data. Under such conditions, it is advisable to evaluate the data with the ¢ test for 
two dependent samples (assuming the interval/ratio scores of subjects are available). If one or 
more of the assumptions of the latter test are saliently violated, the alpha level employed for the 
t test should be adjusted. 


2. Equivalency of the Friedman two-way analysis variance by ranks and the binomial sign 
test for two dependent samples when k 22 In Section VII of the Friedman two-way analysis 
of variance by ranks (Test 25), it is demonstrated that when there are two dependent samples 
and there are no zero difference scores, the latter test (Which can be employed for two or more 
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dependent samples) is equivalent to the chi-square approximation of the binomial sign test for 
two dependent samples (i.e., it will yield the same chi-square value computed with Equation 
8.2). When employing the Friedman two-way analysis of variance by ranks with two de- 
pendent samples, the two scores of each subject (or pair of matched subjects) are rank-ordered. 
The data for Example 19.1 can be expressed in a rank-order format, if for each subject a rank of 
1 is assigned to the lower of the two scores and a rank of 2 is assigned to the higher score (or vice 
versa). If a researcher only has such rank-order information, it is still possible to assign a signed 
difference to each subject, since the ordering of a subject’s two ranks provides sufficient infor- 
mation to determine whether the difference between the two scores of a subject would yield a 
positive or negative value if the interval/ratio scores of the subject were available. Consequently, 
under such conditions one can still conduct the binomial sign test for two dependent samples. 
On the other hand, if a researcher only has the sort of rank-order information noted above, one 
will not be able to evaluate the data with either the f test for two dependent samples or the 
Wilcoxon matched-pairs signed-ranks test, since the latter two tests require the interval/ ratio 
scores of subjects. 


VIII. Additional Examples Illustrating the Use of the Binomial 
Sign Test for Two Dependent Samples 


The binomial sign test for two dependent samples can be employed with any of the additional 
examples noted for the f test for two dependent samples and the Wilcoxon matched-pairs 
signed-ranks test. In each of the examples, a signed difference must be computed for each sub- 
ject (or pair of matched subjects). The signed differences are then evaluated employing the 
protocol for the binomial sign test for two dependent samples. 
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Endnotes 
1. Some sources note that one assumption of the binomial sign test for two dependent 


samples is that the variable being measured is based on a continuous distribution. In 
practice, however, this assumption is often not adhered to. 
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2. Another way of stating the null hypothesis is that in the underlying population the sample 
represents, the proportion of subjects who obtain a positive signed difference is equal to the 
proportion of subjects who obtain a negative signed difference. The null and alternative 
hypotheses can also be stated with respect to the proportion of people in the population who 
obtain a higher score in Condition 2 than Condition 1, thus yielding a negative difference 
score. The notation m- represents the proportion of the population who yield a difference 
with a negative sign (referred to as a negative signed difference). Thus, Hy: m- = .5 can 
be employed as the null hypothesis, and the following nondirectional and directional 
alternative hypotheses can be employed: H,: n- * .5; H: m- > .5; H: m- < .5. 


3. It is also the likelihood of obtaining 8 or 9 negative signed differences in a set of 9 signed 
differences. 


4. Dueto rounding off protocol, the value computed with Equation 19.1 will be either .0195 or 
.0196, depending upon whether one employs Table A6 or Table A7. 


5. Anequivalent way of determining whether or not the result is significant is by doubling the 
value of the cumulative probability obtained from Table A7. In order to reject the null 
hypothesis, the resulting value must not be greater than the value of a. Since 2 x .0195 = 
.039 is less than a = .05, we confirm that the nondirectional alternative hypothesis is 
supported when a = .05. Since .039 is greater than a = .01, it is not supported at the .01 level. 


6. Equations 9.6 and 9.8 are respectively alternate but equivalent forms of Equations 19.3 and 
19.4. Note that in Equations 9.6-9.9, 1, and 1, are employed in place of m+ and n- to 


represent the two population proportions. 


7. A full discussion of the protocol for determining one-tailed chi-square values can be found 
in Section VII of the chi-square goodness-of-fit test. 


8. The minimal discrepancy is the result of rounding off error. 
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Test 20 
The McNemar Test 


(Nonparametric Test Employed with Categorical/Nominal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Do two dependent samples represent two different popula- 
tions? 


Relevant background information on test It is recommended that before reading the material 
on the McNemar test, the reader review the general information on a dependent samples design 
contained in Sections I and VII of the £ test for two dependent samples (Test 17). The 
McNemar test (McNemar, 1947) is a nonparametric procedure for categorical data employed 
in a hypothesis testing situation involving a design with two dependent samples. In actuality, the 
McNemar test is a special case of the Cochran Q test (Test 26), which can be employed to 
evaluate a k dependent samples design involving categorical data, where k > 2. The McNemar 
test is employed to evaluate an experiment in which a sample of n subjects (or n pairs of matched 
subjects) is evaluated on a dichotomous dependent variable (1.e., scores on the dependent variable 
must fall within one of two mutually exclusive categories). The McNemar test assumes that 
each of the n subjects (or n pairs of matched subjects) contributes two scores on the dependent 
variable. The test is most commonly employed to analyze data derived from the two types of ex- 
perimental designs described below. 

a) The McNemar test can be employed to evaluate categorical data obtained in a true 
experiment (i.e., an experiment involving a manipulated independent variable).! In such an 
experiment, the two scores of each subject (or pair of matched subjects) represent a subject's 
responses under the two levels of the independent variable (1.e., the two experimental conditions). 
A significant result allows the researcher to conclude there is a high likelihood the two ex- 
perimental conditions represent two different populations. As is the case with the t test for two 
dependent samples, the Wilcoxon matched-pairs signed-ranks test (Test 18), and the bi- 
nomial sign test for two dependent samples (Test 19), when the McNemar test is employed 
to evaluate the data for a true experiment, in order for the test to generate valid results, the 
following guidelines should be adhered to: 1) In order to control for order effects, the presen- 
tation of the two experimental conditions should be random or, if appropriate, be counter- 
balanced; and 2) If matched samples are employed, within each pair of matched subjects each 
of the subjects should be randomly assigned to one of the two experimental conditions. 

b) The McNemar test can be employed to evaluate a before-after design (which is 
described in Section VII of the ¢ test for two dependent samples). In applying the McNemar 
test to a before—after design, n subjects are administered a pretest on a dichotomous dependent 
variable. Following the pretest, all of the subjects are exposed to an experimental treatment, after 
which they are administered a posttest on the same dichotomous dependent variable. The 
hypothesis evaluated with a before—after design is whether or not there is a significant difference 
between the pretest and posttest scores of subjects on the dependent variable. The reader is 
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advised to review the discussion of the before-after design in Section VII of the t test for two 
dependent samples, since the limitations noted for the design also apply when it is evaluated 
with the McNemar test. 

The 2 x 2 table depicted in Table 20.1 summarizes the McNemar test model. The entries 
for Cells a, b, c, and d in Table 20.1 represent the number of subjects/observations in each of 
four possible categories that can be employed to summarize the two responses of a subject (or 
matched pair of subjects) on a dichotomous dependent variable. Each of the four response cate- 
gory combinations represents the number of subjects/observations whose response in Condition 
1/Pretest falls in the response category for the row in which the cell falls, and whose response 
in Condition 2/Posttest falls in the response category for the column in which the cell falls. Thus, 
the entry in Cell a represents the number of subjects who respond in Response category 1 in both 
Condition 1/Pretest and Condition 2/Posttest. The entry in Cell b represents the number of 
subjects who respond in Response category 1 in Condition 1/Pretest and in Response category 
2 in Condition 2/Posttest. The entry in Cell c represents the number of subjects who respond in 
Response category 2 in Condition 1/Pretest and in Response category 1 in Condition 2/Posttest. 
The entry in Cell d represents the number of subjects who respond in Response category 2 in 
both Condition 1/Pretest and Condition 2/Posttest. 


Table 20.1 Model for the McNemar Test 


Condition 2/Posttest 
Response Response 
category 1 category 2 Row sums 
Condition 1/Pretest Response category 1 g id a+b=n, 
Response category 2 c d c+d=n, 
Column sums a+c b+d n 


The McNemar test is based on the following assumptions: a) The sample of n subjects 
has been randomly selected from the population it represents; b) Each of the n observations in 
the contingency table is independent of the other observations; c) The scores of subjects are in 
the form of a dichotomous categorical measure involving two mutually exclusive categories; 
and d) Most sources state that the McNemar test should not be employed with extremely 
small sample sizes. Although the chi-square distribution is generally employed to evaluate the 
McNemar test statistic, in actuality the latter distribution is used to provide an approximation 
of the exact sampling distribution which is, in fact, the binomial distribution. When the sample 
size is small, in the interest of accuracy, the exact binomial probability for the data should be 
computed. Sources do not agree on the minimum acceptable sample size for computing the 
McNemar test statistic (i.e., using the chi-square distribution). Some sources endorse the use 
of a correction for continuity with small sample sizes (discussed in Section VI), in order to insure 
that the computed chi-square value provides a more accurate estimate of the exact binomial 
probability. 


II. Examples 


Since, as noted in Section I, the McNemar test is employed to evaluate a true experiment and 
a before-after design, two examples, each representing one of the aforementioned designs, will 
be presented in this section. Since the two examples employ identical data, they will result in the 
same conclusion with respect to the null hypothesis. Example 20.1 describes a true experiment 
and Example 20.2 describes a study that employs a before-after design. 
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Example 20.1 A psychologist wants to compare a drug for treating enuresis (bed-wetting) with 
a placebo. One hundred enuretic children are administered both the drug (Endurin) and a 
placebo in a double blind study conducted over a six month period. During the duration of the 
study, each child has six drug and six placebo treatments, with each treatment lasting one week. 
To insure that there are no carryover effects from one treatment to another, during the week 
following each treatment a child is not given either the drug or the placebo. The order of 
presentation of the 12 treatment periods for each child is randomly determined. The dependent 
variable in the study is a parent's judgement with respect to whether or not a child improves 
under each of the two experimental conditions. Table 20.2 summarizes the results of the study. 
Do the data indicate the drug was effective? 


Table 20.2. Summary of Data for Example 20.1 


Favorable response to drug 


Yes No Row sums 
Favorable response Yes 10 13 23 
to placebo No 41 36 77 
Column sums 51 49 100 


Note that the data in Table 20.2 indicate the following: a) 10 subjects respond favorably 
to both the drug and the placebo; b) 13 subjects do not respond favorably to the drug but do 
respond favorably to the placebo; c) 41 subjects respond favorably to the drug but do not respond 
favorably to the placebo; and d) 36 subjects do not respond favorably to either the drug or the 
placebo. Of the 100 subjects, 51 respond favorably to the drug, while 49 do not. 23 of the 100 
subjects respond favorably to the placebo, while 77 do not. 


Example 20.2 A researcher conducts a study to investigate whether or not a weekly television 
series that is highly critical of the use of animals as subjects in medical research influences 
public opinion. One hundred randomly selected subjects are administered a pretest to determine 
their attitude concerning the use of animals in medical research. Based on their responses, sub- 
jects are categorized as pro-animal research or anti-animal research. Following the pretest, all 
of the subjects are instructed to watch the television series (which last two months). At the 
conclusion of the series each subject’s attitude toward animal research is reassessed. The results 
of the study are summarized in Table 20.3. Do the data indicate that a shift in attitude toward 
animal research occurred after subjects viewed the television series? 


Table 20.3 Summary of Data for Example 20.2 


Posttest 
Anti Pro Row sums 
Anti 10 13 23 
Pretest Pro 4l 36 77 
Column sums 51 49 100 


Note that the data in Table 20.3 indicate the following: a) 10 subjects express an anti- 
animal research attitude on both the pretest and the posttest; b) 13 subjects express an anti-animal 
research attitude on the pretest but a pro-animal research attitude on the posttest; c) 41 subjects 
express a pro-animal research attitude on the pretest but an anti-animal research attitude on the 
posttest; and d) 36 subjects express a pro-animal research attitude on both the pretest and the 
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posttest. Of the 100 subjects, 23 are anti on the pretest and 77 are pro on the pretest. Of the 100 
subjects, 51 are anti on the posttest, while 49 are pro on the posttest. 
Table 20.4 summarizes that data for Examples 20.1 and 20.2. 
Table 20.4 Summary of Data for Examples 20.1 and 20.2 


Favorable response to 


drug/Posttest 
Yes/Anti No/Pro Row sums 
Favorable response Yes/Anti a - 10 b=13 23 
to placebo/Pretest No/Pro c=41 d=36 77 
Column sums 51 49 100 


III. Null versus Alternative Hypotheses 


In conducting the McNemar test, the cells of interest in Table 20.4 are Cells b and c, since the 
latter two cells represent those subjects who respond in different response categories under the 
two experimental conditions (in the case of a true experiment) or in the pretest versus posttest 
(in the case of a before-after design). In Example 20.1, the frequencies recorded in Cells b and 
c, respectively, represent subjects who respond favorably to the placebo/unfavorably to the 
drug and favorably to the drug/unfavorably to the placebo. If the drug is more effective than 
the placebo, one would expect the proportion of subjects in Cell c to be larger than the proportion 
of subjects in Cell b. In Example 20.2, the frequencies recorded in Cells b and c, respectively, 
represent subjects who are anti-animal research in the pretest/pro-animal research in the 
posttest and pro-animal research in the pretest/anti-animal research in the posttest. If there 
is a shift in attitude from the pretest to the posttest (specifically from pro-animal research to 
anti-animal research), one would expect the proportion of subjects in Cell c to be larger than 
the proportion of subjects in Cell b. 

It will be assumed that in the underlying population, x, and m. represent the following 
proportions: t, = b/(b + c) and t, = c/(b + c). Ifthereis no difference between the two ex- 
perimental conditions (in the case of a true experiment) or between the pretest and the posttest 
(in the case of a before-after design), the following will be true: 1, = m, = .5. With respect 
to the sample data, the values T, and m, are estimated with the values p, and p,, which in the 
case of Examples 20.1 and 20.2 are p, = b/(b +c) = 13/13 + 41) = .24 and 
D, = cl(b + c) = 41/(13 + 41) = .76. 

Employing the above information the null and alternative hypotheses for the McNemar test 
can now be stated.’ 

Null hypothesis Hy: Tm, =T 


[4 


(In the underlying population the sample represents, the proportion of observations in Cell b 
equals the proportion of observations in Cell c.) 


Alternative hypothesis H: T, m, 


(In the underlying population the sample represents, the proportion of observations in Cell b does 
not equal the proportion of observations in Cell c. This is a nondirectional alternative hy- 
pothesis and it is evaluated with a two-tailed test. In order to be supported, the proportion of 
observations in Cell b (p,) can be either significantly larger or significantly smaller than the 
proportion of observations in Cell c (p,). In the case of Example 20.1, this alternative hypothesis 
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will be supported if the proportion of subjects who respond favorably to the placebo/ 
unfavorably to the drug is significantly greater than the proportion of subjects who respond 
favorably to the drug/unfavorably to the placebo, or the proportion of subjects who respond 
favorably to the drug/unfavorably to the placebo is significantly greater than the proportion 
of subjects who respond favorably to the placebo/unfavorably to the drug. In the case of 
Example 20.2, this alternative hypothesis will be supported if, in the pretest versus posttest, a 
significantly larger proportion of subjects shift their response from pro-animal research to anti- 
animal research or a significantly larger proportion of subjects shift their response from anti- 
animal research to pro-animal research.) 


Or 
Hy: m, m, 


(In the underlying population the sample represents, the proportion of observations in Cell 5 is 
greater than the proportion of observations in Cell c. This is a directional alternative hy- 
pothesis and it is evaluated with a one-tailed test. In order to be supported, the proportion of 
observations in Cell b (p,) must be significantly larger than the proportion of observations in 
Cell c (p,). In the case of Example 20.1, this alternative hypothesis will be supported if the 
proportion of subjects who respond favorably to the placebo/unfavorably to the drug is 
significantly greater than the proportion of subjects who respond favorably to the 
drug/unfavorably to the placebo. In the case of Example 20.2, this alternative hypothesis will 
be supported if, in the pretest versus posttest, a significantly larger proportion of subjects shift 
their response from anti-animal research to pro-animal research.) 


Or 
Hy; m, < m, 


(In the underlying population the sample represents, the proportion of observations in Cell 5 is 
less than the proportion of observations in Cell c. This is a directional alternative hypothesis 
and it is evaluated with a one-tailed test. In order to be supported, the proportion of observations 
in Cell b (p,) must be significantly smaller than the proportion of observations in Cell c (p,). 
In the case of Example 20.1, this alternative hypothesis will be supported if the proportion of 
subjects who respond favorably to the drug/unfavorably to the placebo is significantly greater 
than the proportion of subjects who respond favorably to the placebo/unfavorably to the drug. 
In the case of Example 20.2, this alternative hypothesis will be supported if, in the pretest versus 
posttest, a significantly larger proportion of subjects shift their responses from pro-animal 
research to anti-animal research.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected. 


IV. Test Computations 
The test statistic for the McNemar test, which is based on the chi-square distribution, is com- 
puted with Equation 20.1? 


2 (5 -o 


(Equation 20.1) 
b+c 


X 


Where: b and c represent the number of observations in Cells b and c of the McNemar test 
summary table 
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Substituting the appropriate values in Equation 20.1, the value X? = 14.52 is computed for 
Examples 20.1/20.2. 
ze 09 99 capa) 
13 + 41 
The computed chi-square value must always be a positive number. If a negative value is 
obtained, it indicates that an error has been made. The only time the value of chi-square will 
equal zero is when b = c. 


V. Interpretation of the Test Results 


The obtained value x? = 14.52 is evaluated with Table A4 (Table of the Chi-Square Distri- 
bution) in the Appendix.* The degrees of freedom employed in the analysis are df = 1.° 
Employing Table A4, for df= 1 the tabled critical two-tailed .05 and .01 chi-square values are 
Xs — 3.84 (which corresponds to the chi-square value at the 95th percentile) and Yi = 6.63 
(which corresponds to the chi-square value at the 99th percentile). The tabled critical one-tailed 
.05 and .01 values are Xs = 2.71 (which corresponds to the chi-square value at the 90th 
percentile) and Xii = 5.43 (which corresponds to the chi-square value at the 98th percentile). 

The following guidelines are employed in evaluating the null hypothesis for the McNemar 
test. 

a) If the nondirectional alternative hypothesis H,: x, + 1, isemployed, the null hypothe- 
sis can be rejected if the obtained chi-square value is equal to or greater than the tabled critical 
two-tailed value at the prespecified level of significance. 

b) If a directional alternative hypothesis is employed, only the directional alternative 
hypothesis that is consistent with the data can be supported. With respect to the latter alternative 
hypothesis, the null hypothesis can be rejected if the obtained chi-square value is equal to or 
greater than the tabled critical one-tailed value at the prespecified level of significance. 

Applying the above guidelines to Examples 20.1/20.2, we can conclude the following. 

Since the obtained value X? = 14.52 is greater than the tabled critical two-tailed values 
Xs = 3.84 and Xo = 6.63, the nondirectional alternative hypothesis H,: x, * 1, is supported 
at both the .05 and .01 levels. Since x? = 14.52 is greater than the tabled critical one-tailed 
values Xos = 2.71 and Xoi = 5.43, the directional alternative hypothesis H,: m, < m, is 
supported at both the .05 and .01 levels (since p, = .24 is less than p, = .76). 

A summary of the analysis of Examples 20.1 and 20.2 with the McNemar test follows: 

Example 20.1: It can be concluded that the proportion of subjects who respond favorably 
to the drug is significantly greater than the proportion of subjects who respond favorably to the 
placebo. 

Example 20.2: It can be concluded that following exposure to the television series, there 
is a significant change in attitude toward the use of animals as subjects in medical research. The 
direction of the change is from pro-animal research to anti-animal research. It is important to 
note, however, that since Example 20.2 is based on a before-after design, the researcher is not 
justified in concluding that the change in attitude is a direct result of subjects watching the 
television series. This is the case because (as noted in Section VII of the ¢ test for two depen- 
dent samples) a before—after design is an incomplete experimental design. Specifically, in order 
to be an adequately controlled experimental design, a before-after design requires the addition 
of a control group that is administered the identical pretest and posttest at the same time periods 
as the group described in Example 20.2. The control group, however, would not be exposed to 
the television series between the pretest and the posttest. Without inclusion of such a control 
group, it is not possible to determine whether an observed change in attitude from the pretest to 
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the posttest is due to the experimental treatment (i.e., the television series), or is the result of one 
or more extraneous variables that may also have been present during the intervening time period 
between the pretest and the posttest. 


VI. Additional Analytical Procedures for the McNemar Test 
and/or Related Tests 


1. Alternative equation for the McNemar test statistic based on the normal distribution 
Equation 20.2 is an alternative equation that can be employed to compute the McNemar test 
statistic. It yields a result that is equivalent to that obtained with Equation 20.1. 


b-c 


yb +c 
The sign of the computed z value is only relevant insofar as it indicates the directional 
alternative hypothesis with which the data are consistent. Specifically, the z value computed with 
Equation 20.2 will be a positive number if the number of observations in Cell b is greater than 
the number of observations in Cell c, and it will be a negative number if the number of 
Observations in Cell c is greater than the number of observations in Cell b. Since in Examples 
20.1/20.2 c > b, the computed value of z will be a negative number. Substituting the appropriate 
values in Equation 20.2, the value z = —3.81 is computed. 





(Equation 20.2) 


po m TE e m 


y13 + 41 


The square of the z value obtained with Equation 20.2 will always equal the chi-square 
value computed with Equation 20.1. This relationship can be confirmed by the fact that 
(z = -3.81 = (x? = 14.52). It is also the case that the square of a tabled critical z value at 
a given level of significance will equal the tabled critical chi-square value at the corresponding 
level of significance. 

The obtained z value is evaluated with Table A1 (Table of the Normal Distribution) in 
the Appendix. In Table A1 the tabled critical two-tailed .05 and .01 values are zy, = 1.96 
and Z = 2.58, and the tabled critical one-tailed .05 and .01 values are zy, = 1.65 and 
Zo, = 2.33. In interpreting the z value computed with Equation 20.2, the following guidelines 
are employed. 

a) If the nondirectional alternative hypothesis H,: t, + T, is employed, the null hypoth- 
esis can be rejected if the obtained absolute value of z is equal to or greater than the tabled 
critical two-tailed value at the prespecified level of significance. 

b) If a directional alternative hypothesis is employed, only the directional alternative hy- 
pothesis that is consistent with the data can be supported. With respect to the latter alternative 
hypothesis, the null hypothesis can be rejected if the obtained absolute value of z is equal to or 
greater than the tabled critical one-tailed value at the prespecified level of significance. 

Employing the above guidelines with Examples 20.1/20.2, we can conclude the following. 

Since the obtained absolute value z = 3.81 is greater than the tabled critical two-tailed 
values zg, = 1.96 and z,, = 2.58, the nondirectional alternative hypothesis H,: m, # m, 
is supported at both the .05 and .01 levels. Since the obtained absolute value z = 3.81 is greater 
than the tabled critical one-tailed values zo; = 1.65 and zo, = 2.33, the directional alternative 
hypothesis H,: v, < m, is supported at both the .05 and .01 levels. These conclusions are 
identical to those reached when Equation 20.1 is employed to evaluate the same set of data. 
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2. The correction for continuity for the McNemar test Since the McNemar test employs 
a continuous distribution to approximate a discrete probability distribution, some sources recom- 
mend that a correction for continuity be employed in computing the test statistic. Sources that 
recommend such a correction, either recommend it be limited to small sample sizes or that it be 
used in all instances.’ Equations 20.3 and 20.4 are the continuity- corrected versions of 
Equations 20.1 and 20.2.8 


à poe] coy 


(Equation 20.3) 
b+c 


X 


"T b -c| - 1 
yb +c 


Substituting the appropriate values in Equations 20.3 and 20.4, the values X? = 13.5 and 
Z = 3.67 are computed for Examples 20.1/20.2. 


(Equation 20.4) 


2 _ ({13 - 41| - 1) 2 
13 + 41 


X 13.5 


pa EXE 
yi3 + 41 


As is the case without the continuity correction, the square of the z value obtained with 
Equation 20.4 will always equal the chi-square value computed with Equation 20.3. This 
relationship can be confirmed by the fact that (z = 3.67)? = (y? = 13.5). Note that the chi- 
square value computed with Equation 20.3 will always be less than the value computed with 
Equation 20.1. In the same respect, the absolute value of z computed with Equation 20.4 will 
always be less than the absolute value of z computed with Equation 20.2. The lower absolute 
values computed for the continuity-corrected statistics reflect the fact that the latter analysis 
provides a more conservative test of the null hypothesis than does the uncorrected analysis. In 
this instance, the decision the researcher makes with respect to the null hypothesis is not affected 
by the correction for continuity, since the values X? = 13.5 and z = 3.67 are both greater than 
the relevant tabled critical one- and two-tailed .05 and .01 values. Thus, the nondirectional 
alternative hypothesis H,: 1, + x, and the directional alterative hypothesis H,: x, < m, 
are supported at both the .05 and .01 levels. 


3.67 


3. Computation of the exact binomial probability for the McNemar test model with a small 
sample size In Section I it is noted that the exact probability distribution for the McNemar test 
model is the binomial distribution, and that the chi-square distribution is employed to 
approximate the latter distribution. Although for large sample sizes the chi-square distribution 
provides an excellent approximation of the binomial distribution, many sources recommend that 
for small sample sizes the exact binomial probabilities be computed. In order to demonstrate the 
computation of an exact binomial probability for the McNemar test model, assume that Table 
20.5 is a revised summary table for Examples 20.1/20.2. 

Note that although the frequencies for Cells a and d in Table 20.5 are identical to those 
employed in Table 20.4, different frequencies are employed for Cells b and c. Although in 
Table 20.5 the total sample size of n = 56 is reasonably large, the total number of subjects in 
Cells b and c is quite small, and, in the final analysis, it is when the sum of the frequencies of the 
latter two cells is small that computation of the exact binomial probability is recommended. 
The fact that the frequencies of Cells a and d are not taken into account represents an obvious 
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Table 20.5 Revised Summary Table for Examples 20.1 and 20.2 for Binomial Analysis 


Favorable response to 


drug/Posttest 
Yes/Anti No/Pro Row sums 
Favorable response to Yes/Anti a=10 b=2 12 
placebo/Pretest No/Pro c= 8 d= 36 44 
Column sums 18 38 56 


limitation of the McNemar test. In point of fact, the frequencies of Cells a and d could be 0 and 
0 instead of 10 and 36, and the same result will be obtained when the McNemar test statistic is 
computed. In the same respect, frequencies of 1000 and 3600 for Cells a and d will also yield 
the identical result. Common sense suggests, however, that the difference |b -c| = |2-8| = 
will be considered more important if the total sample size is small (which will be the case if the 
frequencies of Cells a and d are 0 and 0) than if the total sample size is very large (which will 
be the case if the frequencies of Cells a and d are 1000 and 3600). What the latter translates into 
is that a significant difference between Cells b and c may be of little or no practical significance, 
if the total number of observations in all four cells is very large. 

Employing Equations 20.1 and 20.2 with the data in Table 20.5, the values X? = 3.6 and 

= 1.90 are computed for the McNemar test statistic. 


2.2-3 
2+8 


z=” = 1.90 





Employing Equations 20.3 and 20.4, the values y? = 2.5 and z = 1.58 are the continuity- 
corrected values computed for the McNemar test statistic. 


2_ (2-8) -7., 
2+8 


28-8 1 
y2 + 8 


Employing Table A1, we determine that the exact one-tailed probability for the value 
z = 1.90 computed with Equation 20.2 (as well as for X? = 3.6 computed with Equation 20.1) 
is .0287. We also determine that the exact one-tailed probability for the value z = 1.58 com- 
puted with Equation 20.4 (as well as for X? = 2.5 computed with Equation 20.3) is .0571.’ 
Note that since the continuity-correction results in a more conservative test, the probability 
associated with the continuity corrected value will always be higher than the probability associ- 
ated with the uncorrected value. Without the continuity S the directional alternative 
hypothesis H,: t, < m, is supported at the .05 level, since X? = 3.6/z = 1.90 are greater 
than the tabled critical one-tailed values X; os = 2.71/z o5 = = 1.65. The nondirectional alterna- 
tive hypothesis H,: T, * m, is not supported, since X? = 3.6/z = 1.90 are less than the 
tabled critical two- tailed values x. os = 3.84/zy, = 1.96. When the continuity correction is 
employed, the oo alternative hypothesis H,: x, < m, fails to achieve significance at the 
.05 level, since X? = 2.5/z = 1.58 are less than Xs = = 2. "i Zos = 1.65. The nondirectional 


= 1.58 
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alternative hypothesis H,: m, * T, is not supported, since X? = 2.5/z = 1.58 are less than 
the tabled critical two-tailed values Xs = 3.84/zZ,, = 1.96. 

At this point, the exact binomial probability will be computed for the same set of data. As 
is the case with the equations for the McNemar test that are based on the chi-square and normal 
distributions, the binomial analysis only considers the frequencies of Cells b and c. Since only 
two cells are taken into account, the binomial analysis becomes identical to the analysis described 
for the binomial sign test for a single sample (Test 9).'° 

Equation 20.5 is the binomial equation that is employed to determine the likelihood of 
obtaining a frequency of 8 or larger in one of the two cells (or 2 or less in one of the two cells) 
in the McNemar model summary table, if the total frequency in the two cells is 10, where, 
m=b+c=2+8=10. Note that Equation 20.5 is identical to Equation 9.5, except for the fact 
that 7, and 7m, are used in place of 7, and 7, , and the value m, which represents b + c, is used 
in place of n. 


m 


P(>x) = >| 2 (nj. (n) -9 (Equation 20.5) 


r-x 


In evaluating the data in Table 20.5, the following values are employed in Equation 20.5: 
T, = T, = .5 (which will be the case if the null hypothesis is true), m = 10, x = 8. 


P(x > 8) = | | (.5)® (5y + | a (.5)°(.5)! + | i (.5)9(.59 = .0547 


The computed probability .0547 is the likelihood of obtaining a frequency of 8 or greater 
in one of the two cells (as well as the likelihood of obtaining a frequency of 2 or less in one of 
the two cells). The value .0547 can also be obtained from Table A7 (Table of the Binomial 
Distribution, Cumulative Probabilities) in the Appendix. In using Table A7 we find the 
section for m = 10 (which is represented by n = 10 in the table), and locate the cell that is the 
intersection of the row x = 8 and the column x = .5. The entry for the latter cell is .0547. The 
value .0547 computed for the exact binomial probability is quite close to the continuity-corrected 
probability of .0571 obtained with Equations 20.3/20.4 (which suggests that even when the sample 
size is small, the continuity-corrected chi-square/normal approximation provides an excellent 
estimate of the exact probability). As is the case when the data are evaluated with Equations 
20.3/20.4, the directional alternative hypothesis H,: x, < 1, is not supported if the binomial 
analysis is employed. This is the case, since the probability .0547 is greater than a = .05. In order 
for the directional alternative hypothesis H,: t, < m, to be supported, the tabled probability 
must be equal to or less than a = .05. The nondirectional alternative hypothesis H,: m, # m, is 
also not supported, since the probability .0547 is greater than «/2 = .05/2 = .025." 


4. Additional analytical procedures for the McNemar test 

a) A procedure for computing a confidence interval for the difference between the marginal 
probabilities (i.e., [(a + b)/n] — [(a + c)/n]) in a McNemar test summary table is described in 
Marascuilo and McSweeney (1977) and Fleiss (1981). 

b) Daniel (1990) and Fleiss (1981) provide references that discuss the power of the 
McNemar test relative to alternative procedures (such as the Gart test for order effects which 
is discussed in Section VII). Zar (1999, p. 171) describes power computations for the McNemar 
test. 

b) Fleiss (1981), who provides a detailed discussion of the McNemar test, notes that an 
odds ratio (which is discussed in Section VI of the chi-square test for r x c tables (Test 16)) 
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can be computed for the McNemar test summary table. Specifically, the odds ratio (o) is 
computed with Equation 20.6. Employing the latter equation, the value o = 3.15 is computed 
for Example 20.1. 


0 = n = — = 3.15 (Equation 20.6) 


In reference to Example 20.1, the computed value o = 3.15 indicates that the odds of a 
person responding to the drug are 3.15 times greater than the odds of a person responding to the 
placebo. 

c) Fleiss (1981) also notes that if the null hypothesis is rejected in a study such as that 
described by Example 20.1, Equation 20.7 can be employed to determine the relative difference 
(represented by the notation p,) between the two treatments. Equation 20.7 is employed with 
the data for Example 20.1 to compute the value p, = .36. 


=i 7 = LL I = 36 Equation 20.7 
Pe ed Ae Bb Eq ! 


The computed value .36 indicates that in a sample of 100 patients who do not respond 
favorably to the placebo, (.36)(100) = 36 would be expected to respond favorably to the drug. 
Fleiss (1981) describes the computation of the estimated standard error for the value of the 
relative difference computed with Equation 20.7, as well as the procedure for computing a 
confidence interval for the relative difference. 

d) In Section IX (the Addendum) of the Pearson product-moment correlation coefficient 
(Test 28), the use of the phi coefficient (Test 16g described in Section VI of the chi-square test 
forr x c tables) as a measure of association for the McNemar test model is discussed within the 
context of the tetrachoric correlation coefficient (Test 28j). 


VII. Additional Discussion of the McNemar Test 


1. Alternative format for the McNemar test summary table and modified test equation 
Although in this book Cells b and c are designated as the two cells in which subjects are in- 
consistent with respect to their response categories, the McNemar test summary table can be 
rearranged so that Cells a and d become the relevant cells. Table 20.6 represents such a re- 
arrangement with respect to the response categories employed in Examples 20.1/20.2. 


Table 20.6 Alternative Format for Summary of Data for Examples 20.1 and 20.2 


Favorable response 


to drug/Posttest 
Yes/Anti No/Pro Row sums 
Favorable response No/Pro a -4l b 236 77 
to placebo/Pretest Yes/Anti c=10 d=13 23 
Column sums 51 49 100 


Note that in Table 20.6 Cells a and d are the key cells, since subjects who are inconsistent 
with respect to response categories are represented in these two cells. If Table 20.6 is employed 
as the summary table for the McNemar test, Cells a and d will, respectively, replace Cells c and 
b in stating the null and alternative hypotheses. In addition, Equations 20.8 and 20.9 will, re- 
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spectively, be employed in place of Equations 20.1 and 20.2. When the appropriate values are 
substituted in the aforementioned equations, Equation 20.8 yields the value X? = 14.52 computed 
with Equation 20.1, and Equation 20.9 yields the value z = — 3.81 computed with Equation 20.2. 


ny2 B 2 
gs cur UE lupa (Equation 20.8) 


Z5 -———— = -3.81 (Equation 20.9) 





2. Alternative nonparametric procedures for evaluating a design with two dependent 
samples involving categorical data Gart (1969) has developed a test for evaluating a design 
with two dependent samples involving categorical data that can determine whether the order of 
presentation of two treatments influences subjects' responses to the treatments. The Gart test 
for order effects is based on the use of the Fisher exact test (Test 16c) with two 2 x 2 con- 
tingency tables, which summarize the responses of subjects in relation to the differential 
treatments as well as the order of presentation of the treatments. The Gart test for order effects 
is described in Everitt (1977, pp. 22-26; 1992) and Zar (1999, pp. 173-175). The McNemar 
test model has been extended by Bowker (1948) (The Bowker test of symmetry (Test 20a)) 
and Stuart (1955, 1957) (The Stuart test) to a dependent samples design in which the dependent 
variable is a categorical measure that is comprised of more than two categories. The latter two 
tests are discussed in Section IX (the Addendum). 


VIII. Additional Examples Illustrating the Use of the McNemar Test 


Three additional examples that can be evaluated with the McNemar test are presented in this 
section. Since Examples 20.3-20.5 employ the same data employed in Example 20.1, they yield 
the identical result. 


Example 20.3 In order to determine if there is a relationship between schizophrenia and en- 
larged cerebral ventricles, a researcher evaluates 100 pairs of identical twins who are 
discordant with respect to schizophrenia (i.e., within each twin pair, only one member of the pair 
has schizophrenia). Each subject is evaluated with a CAT scan to determine whether or not there 
is enlargement of the ventricles. The results of the study are summarized in Table 20.7. Do the 
data indicate there is a statistical relationship between schizophrenia and enlarged ventricles? 


Table 20.7 Summary of Data for Example 20.3 
Schizophrenic twin Row sums 


Enlarged ventricles Normal ventricles 


Normaltwin Enlarged ventricles 10 13 23 
ORNAS ENY Normal ventricles 41 36 77 
Column sums 51 49 100 


Since Table 20.7 summarizes categorical data derived from n = 100 pairs of matched 
subjects, the McNemar test is employed to evaluate the data. The reader should take note of the 
fact that since the independent variable employed in Example 20.3 is nonmanipulated (spe- 
cifically, it is whether or not a subject is schizophrenic or normal), analysis of the data will only 
provide correlational information, and thus will not allow the researcher to draw conclusions with 
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regard to cause and effect. In other words, although the study indicates that schizophrenic 
subjects are significantly more likely than normal subjects to have enlarged ventricles, one cannot 
conclude that enlarged ventricles cause schizophrenia or that schizophrenia causes enlarged 
ventricles. Although either of the latter is possible, the design of the study only allows one to 
conclude that the presence of enlarged ventricles is associated with schizophrenia. 


Example 20.4 A company that manufactures an insecticide receives complaints from its em- 
ployees about premature hair loss. An air quality analysis reveals a large concentration of a 
vaporous compound emitted by the insecticide within the confines of the factory. In order to 
determine whether or not the vaporous compound (which is known as Acherton) is related to hair 
loss, the following study is conducted. Each of 100 mice is exposed to air containing high con- 
centrations of Acherton over a two-month period. The same mice are also exposed to air that 
is uncontaminated with Acherton during another two-month period. Half of the mice are initially 
exposed to the Acherton contaminated air followed by the uncontaminated air, while the other 
half are initially exposed to the uncontaminated air followed by the Acherton contaminated air. 
The dependent variable in the study is whether or not a mouse exhibits hair loss during an 
experimental condition. Table 20.8 summarizes the results of the study. Do the data indicate 
a relationship between Acherton and hair loss? 


Table 20.8 Summary of Data for Example 20.4 


Acherton contaminated air Row sums 
Hair loss No hair loss 
Uncontaminated air TIE loss 10 13 23 
No hair loss 41 36 77 
Column sums 51 49 100 


Analysis of the data in Table 20.8 reveals that the mice are significantly more likely to 
exhibit hair loss when exposed to Acherton as opposed to when they are exposed to uncon- 
taminated air. Although the results of the study suggest that Acherton may be responsible for 
hair loss, one cannot assume that the results can be generalized to humans. 


Example 20.5 A market research firm is hired to determine whether or not a debate between 
the two candidates who are running for the office of Governor influences voter preference. The 
gubernatorial preference of 100 randomly selected voters is determined before and after a 
debate between the two candidates, Edgar Vega and Vera Myers. Table 20.9 summarizes the 
results of the voter preference survey. Do the data indicate that the debate influenced voter 
preference? 


Table 20.9 Summary of Data for Example 20.5 
Voter preference before debate Row sums 


Edgar Vega Vera Meyers 


Voter preference Edgar Vega 10 13 23 
after debate Vera Meyers 41 36 77 
Column sums 51 49 100 


When the data for Example 20.5 (which represents a before-after design) are evaluated 


© 2000 by Chapman & Hall/CRC 


with the McNemar test, the result indicates that, following the debate, there is a significant shift 
in voter preference in favor of Vera Myers. As noted in Section V, since a before-after design 
does not adequately control for the potential influence of extraneous variables, one cannot rule 
out the possibility that some factor other than the debate is responsible for the shift in voter 
preference. 


IX. Addendum 


Extension of the McNemar test model beyond 2 x 2 contingency tables The McNemar test 
model has been extended by Bowker (1948) and Stuart (1955, 1957) to a dependent samples 
design in which the dependent variable is a categorical measure that is comprised of more than 
two categories. In the test models for the Bowker test of symmetry (Test 20a) (which will be 
described in this section), and the Stuart test, a k x k (i.e., square) contingency table (where k 
is the number of response categories, and k > 3) is employed to categorize n subjects (or n pairs 
of matched subjects) on a dependent variable under the two conditions (or two time periods). 
The Bowker test evaluates differences with respect to the joint probability distributions, or to 
put it more simply, whether the data are distributed symmetrically about the main diagonal of 
the table. The Stuart test, on the other hand, evaluates differences between marginal prob- 
abilities. 


Test 20a: The Bowker test of symmetry Bowker (1948) has developed a test to evaluate 
whether or not the data in a k x k contingency table are distributed symmetrically about the main 
diagonal of the table. Note that in a k x k contingency table, k = r = c (i.e., the numbers of 
rows and columns are equal). A lack of symmetry in a k x k table is interpreted to mean that 
there is a difference in the distribution of the data under the two experimental conditions/time 
periods. In the case of a2 x 2 table, the Bowker test becomes equivalent to the McNemar test. 
Example 20.6 will be employed to illustrate the use of the Bowker test of symmetry. 


Example 20.6 Two drugs that are believed to have mood-altering effects are tested on 275 
subjects. The order of administration of the drugs is counterbalanced so that half the subjects 
receive Drug A followed by Drug B, while the reverse sequence of administration is used for the 
other subjects. Based on his response to each drug a subject is assigned to one of the following 
three response categories: No change in mood (NC); Moderate mood alteration (MA); Dramatic 
mood alteration (DA). Table 20.10 represents a joint distribution that summarizes the responses 
of the 275 subjects to both of the drugs. (The value in any cell in Table 20.10 represents the 
number of subjects whose response to Drug A corresponds to the column category for that cell, 
and whose response to Drug B corresponds to the row category for that cell.) Do the data 
indicate that the two drugs differ with respect to their mood- altering properties? 


Table 20.10 Summary of Data for Example 20.6 





Response to Drug A Row sums 
NC MA DA 
Response to E. id 
Drug B DA 140 
Column sums 75 100 100 275 


The null and alternative hypotheses evaluated with the Bowker test of symmetry are as 
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follows. 


Null hypothesis Hy: p; = pj; (where J >i) 


(In the k x k contingency table, the probabilities are equal for each of the off-diagonal/symmetric 
pairs The latter translates into the fact that the distribution of data above the main diagonal of 
the k x k contingency table is the same as the distribution of data below the main diagonal. With 
respect to Example 20.6, the null hypothesis is stating that the response distributions of subjects 
for the two drugs will be the same.) 


Alternative hypothesis Hj: pj; * pj; for at least one cell (where j > i) 


(In the k x k contingency table, the probabilities are not equal for at least one pair of the total 
off-diagonal/symmetric pairs. The latter translates into the fact that the distribution of data above 
the main diagonal of the k x k contingency table is not the same as the distribution of data below 
the main diagonal. With respect to Example 20.6, the null hypothesis is stating that the response 
distributions of subjects for the two drugs will not be the same. The alternative hypothesis is 
nondirectional.)"? 


Equation 20.10 is employed to compute the test statistic for the Bowker test of symmetry. 


r (n. - ny 
ys Em e (Equation 20.10) 
i=l j>i hi + hj 








The notation in Equation 20.10 indicates the following: a) Each frequency above the main 
diagonal that is in Row i and Column j (i.e., 71, Where j > i) is paired with the frequency below 
the main diagonal that is in Row j and Column i (i.e., 1;;, where j < i). The latter pair is referred 
to as an off-diagonal or symmetric pair. Within each of the off-diagonal pairs the following 
is done: a) The difference between n,, and Nii is obtained; b) The difference is squared; c) The 
squared difference is divided by the sum of n, ; and n,;; and d) All of the values computed in part 
C) are summed, and the resulting value represents the test statistic, which is a chi-square value. 

The number of off-diagonal pairs in a table will be equal to , , Which is the number of 
combinations of k things taken two at a time (see Section IV of the binomial sign test for a 
single sample for a clarification of combinations). The number of off-diagonal pairs is also 
equal to the value (k x k — &y/2, which is equal to [k(k — 1)]/2 (which is the number of degrees 
of freedom employed for the analysis). Thus, if k = 3 (i.e., r = c = 3), there will be three pairs, 
since [2 = 3, or (3 x 3 — 3)/2 = [3(3 - 1)]/2 2 3. Specifically, the three pairs will involve the 
following combinations of cell subscripts: 1,2; 1,3; and 2, 3. Thus, the following pairs of cells 
will be contrasted through use of Equation 20.10 (where the first digit represents the row (i) in 
which the cell appears, and the second digit represents the column (j) in which the cell appears): 
Cell,, versus Cell,,; Cell;, versus Cell,,, Cell; versus Cell,,. Note that for the first cell listed in 
each pair, j > i, and for the second cell in each pair, j « i. 

The data for Example 20.6 are evaluated below with Equation 20.10. 


2 2 2 
2. (n, - ny) " (nig ~ n3) : (n - n3) 
(nj + ny) (ni + n3) (n + n3) 


2 _ 0 - 20? | (16 - 30)” (14 - 40 


- + Se A = 20.11 
(10 + 20) (16 + 30) (14 + 40) 
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As noted earlier, the degrees of freedom employed for the Bowker test analysis are 
[kk — D]/2. Since k = 3, df = [3(-1)]/2 = 3. Employing Table A4, the tabled critical .05 
and .01 chi-square values for df = 3 are Xos = 7.81 and Xo = 11.34. In order to reject the 
null hypothesis, the computed value of chi-square must be equal to or greater than the tabled 
critical value at the prespecified level of significance. Since the computed value y? = 20.11 is 
greater than both of the aforementioned critical values, the null hypothesis can be rejected at both 
the .05 and .01 levels. Inspection of Table 20.6 clearly suggests that Drug B is more likely to 
be associated with a change in mood than Drug A. 

Further discussion of the Bowker test of symmetry can be found in Everitt (1977, pp. 
114-115; 1992), Marascuilo and McSweeney (1977), Marascuilo and Serlin (1988), Sprent 
(1993) and Zar (1999). 

Everitt (1977, 1992) notes that if on the basis of the Bowker test the hypothesis of 
symmetry (which will generally be the hypothesis of primary interest) is rejected, a researcher 
may employ the Stuart test to further clarify the distribution of the data. The Stuart test evalu- 
ates the following null hypothesis: Hy: p; = P; The latter indicates that all of the corres- 
ponding symmetric marginal probabilities are equal (i.e., the probability for the i” row will be 
equal to the probability for the j column, with the requirement that i — j). Sources that describe 
the Stuart test are Everitt (1977, pp. 115-116; 1992), Fleiss (1981), Hinkle et al. (1998), 
Marascuilo and McSweeney (1977), and Marascuilo and Serlin (1988). 
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Endnotes 


1. The distinction between a true experiment and a natural experiment is discussed in more 
detail in the Introduction of the book. 


2. a) The reader should take note of the following with respect to the null and alternative 
hypotheses stated in this section: 

a) If n represents the total number of observations in Cells a, b, c, and d, the 
proportion of observations in Cells b and c can also be expressed as follows: b/n and 
c/n. The latter two values, however, are not equivalent to the values p, and p, that are 
used to estimate the values 1, and x, employed in the null and alternative hypotheses. 

b) Many sources employ an alternative but equivalent way of stating the null and 
alternative hypotheses for the McNemar test. Assume that m, represents the proportion 
of observations in the underlying population who respond in Response category 1 in 
Condition 1/Pretest, and 7, represents the proportion of observations in the underlying 
population who respond in Response category 1 in Condition 2/Posttest. With respect 
to the sample data, the values p, and p, are employed to estimate m, and m, where 
p, = (a + b)n and p, = (a + c)/n. In the case of Examples 20.1 and 20.2, 
p, = (10 + 13)/100 = .23 and p, = (10 + 41)/100 = .51. If there is no difference in the 
proportion of observations in Response category 1 in Condition l/Pretest versus the 
proportion of observations in Response category 1 in Condition 2/Posttest, p, and p, 
would be expected to be equal, and if the latter is true one can conclude that in the 
underlying population t, = 1,. If, however, p, * p, (and consequently in the underlying 
population x, + 7c), it indicates a difference between the two experimental conditions in 
the case of a true experiment, and a difference between the pretest and the posttest 
responses of subjects in the case of a before-after design. Employing this information, 
the null hypothesis can be stated as follows: Hy t, = T,. The null hypothesis 
Hy: T, = T, is equivalent to the null hypothesis H,: t, = 7,. The nondirectional 
alternative hypothesis can be stated as H,: x, # ™,. The nondirectional alternative 
hypothesis H,: 1, # T, is equivalent to the nondirectional alternative hypothesis 

H, : n, * T, The two directional alternative hypotheses that can be employed are 
H,: 1, > ™, or Hy: T, < m. The directional alternative hypothesis H: 1, > m, is 
equivalent to the directional alternative hypothesis H,: m, > m,. The directional 
alternative hypothesis H,: 1, < m, is equivalent to the directional alternative hypothesis 
Ay: T, < m. 


3.  Itcanbe demonstrated algebraically that Equation 20.1 is equivalent to Equation 8.2 (which 
is the equation for the chi-square goodness-of-fit test (Test 8)). Specifically, if Cells a 
and dare eliminated from the analysis, and the chi-square goodness-of-fit test is employed 
to evaluate the observations in Cells b and c, n = b + c. If the expected probability for 
each of the cells is .5, Equation 8.2 reduces to Equation 20.1. As will be noted in Section 
VI, a limitation of the McNemar test (which is apparent from inspection of Equation 20.1) 
is that it only employs the data for two of the four cells in the contingency table. 
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10. 


11. 


12. 


13. 


A general overview of the chi-square distribution and interpretation of the values listed in 
Table A4 can be found in Sections I and V of the single-sample chi-square test for a 
population variance (Test 3). 


The degrees of freedom are based on Equation 8.3, which is employed to compute the 
degrees of freedom for the chi-square goodness-of-fit test. In the case of the McNemar 
test, df = k-—1=2-1=1, since only the observations in Cells b and c (i.e., k = 2 cells) are 
evaluated. 


A full discussion of the protocol for determining one-tailed chi-square values can be found 
in Section VII of the chi-square goodness-of-fit test. 


A general discussion of the correction for continuity can be found under the Wilcoxon 
signed-ranks test (Test 6). Fleiss (1981) notes that the correction for continuity for the 
McNemar test was recommended by Edwards (1948). 


The numerator of Equation 20.4 is sometimes written as (b - c) + 1. In using the latter 
format, 1 is added to the numerator if the term (b - c) results in a negative value, and 1 
is subtracted from the numerator if the term (b - c) results in a positive value. Since we 
are only interested in the absolute value of z, it is simpler to employ the numerator in 
Equation 20.4, which results in the same absolute value that is obtained when the 
alternative form of the numerator is employed. If the alternative form of the numerator is 
employed for Examples 20.1/20.2, it yields the value z = -3.67. 


The values .0287 and .0571, respectively, represent the proportion of the normal 
distribution that falls above the values z = 1.90 and z = 1.58. 


In point of fact, it can also be viewed as identical to the analysis conducted with the 
binomial sign test for two dependent samples. In Section VII of the Cochran Q test, it 
is demonstrated that when the McNemar test (as well as the Cochran Q test when k = 2) 
and the binomial sign test for two dependent samples are employed to evaluate the same 
set of data, they yield equivalent results. 


For a comprehensive discussion on the computation of binomial probabilities and the use 
of Table A7, the reader should review Section IV of the binomial sign test for a single 
sample. 


In some sources the Stuart test is referred to as the Stuart-Maxwell test based on the 
contribution of Maxwell (1970). 


Marascuilo and McSweeney (1977) note that it is only possible to state the alternative 
hypothesis directionally when the number of degrees of freedom employed for the test is 
1, which will always be the case for a 2 x 2 table. 
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Inferential Statistical Tests Employed with 
Two or More Independent Samples 
(and Related Measures of 
Association/Correlation) 


Test 21: The Single-Factor Between-Subjects 
Analysis of Variance 


Test 22: The Kruskal-Wallis One-Way Analysis 
of Variance by Ranks 


Test 23: The van der Waerden Normal-Scores Test 
for k Independent Samples 
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Test 21 


The Single-Factor Between-Subjects Analysis 


of Variance 
(Parametric Test Employed with Interval/Ratio Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test In a set of k independent samples (where k > 2), do at least 
two of the samples represent populations with different mean values? 


Relevant background information on test The term analysis of variance (for which the 
acronym ANOVA is often employed) describes a group of inferential statistical procedures 
developed by the British statistician Sir Ronald Fisher. Analysis of variance procedures are 
employed to evaluate whether or not there is a difference between at least two means in a set of 
data for which two or more means can be computed. The test statistic computed for an analysis 
of variance is based on the F distribution (which is named after Fisher), which is a continuous 
theoretical probability distribution. A computed F value (commonly referred to as an F ratio) 
will always fall within the range 0 < F x œ. As is the case with the t and chi-square distributions 
discussed earlier in the book, there are an infinite number of F distributions — each distribution 
being a function of the number of degrees of freedom employed in the analysis (with degrees of 
freedom being a function of both the number of samples and the number of subjects per sample). 
A more thorough discussion of the F distribution can be found in Section V. 

The single-factor between-subjects analysis of variance is the most basic of the analysis 
of variance procedures.' It is employed in a hypothesis testing situation involving k independent 
samples. In contrast to the f test for two independent samples (Test 11), which only allows for 
a comparison between the means of two independent samples, the single-factor between-sub- 
jects analysis of variance allows for a comparison of two or more independent samples. The 
single-factor between-subjects analysis of variance is also referred to as the completely 
randomized single-factor analysis of variance, the simple analysis of variance, the one-way 
analysis of variance, and the single-factor analysis of variance. 

In conducting the single-factor between-subjects analysis of variance, each of the k 
sample means is employed to estimate the value of the mean of the population the sample rep- 
resents. If the computed test statistic is significant, it indicates there is a significant difference 
between at least two of the sample means in the set of k means. As a result of the latter, the 
researcher can conclude there is a high likelihood that at least two of the samples represent 
populations with different mean values. 

In order to compute the test statistic for the single-factor between-subjects analysis of 
variance, the total variability in the data is divided into between-groups variability and 
within-groups variability. Between-groups variability (which is also referred to as treatment 
variability) is essentially a measure of the variance of the means of the k samples. Within- 
groups variability (which is essentially an average of the variance within each of the k samples) 
is variability that is attributable to chance factors that are beyond the control of a researcher. 
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Since such chance factors are often referred to as experimental error, within-groups variability 
is also referred to as error or residual variability. The F ratio, which is the test statistic for the 
single-factor between-subjects analysis of variance, is obtained by dividing between-groups 
variability by within-groups variability. Since within-groups variability is employed as a 
baseline measure of the variability in a set of data that is beyond a researcher's control, it is 
assumed that if the k samples are derived from a population with the same mean value, the 
amount of variability between the sample means (i.e., between-groups variability) will be 
approximately the same value as the amount of variability within any single sample (i.e., within- 
groups variability). If, on the other hand, between-groups variability is significantly larger 
than within-groups variability (in which case the value of the F ratio will be larger than 1), it 
is likely that something in addition to chance factors is contributing to the amount of variability 
between the sample means. In such a case, it is assumed that whatever it is that differentiates the 
groups from one another (i.e., the independent variable/experimental treatments) accounts for the 
fact that between-groups variability is larger than within-groups variability? A thorough 
discussion of the logic underlying the single-factor between-subjects analysis of variance can 
be found in Section VII. 

The single-factor between-subjects analysis of variance is employed with interval/ratio 
data and is based on the following assumptions: a) Each sample has been randomly selected 
from the population it represents; b) The distribution of data in the underlying population from 
which each of the samples is derived is normal; and c) The third assumption, which is referred 
to as the homogeneity of variance assumption, states that the variances of the k underlying 
populations represented by the k samples are equal to one another. The homogeneity of variance 
assumption is discussed in detail in Section VI? If any of the aforementioned assumptions of the 
single-factor between-subjects analysis of variance are saliently violated, the reliability of the 
computed test statistic may be compromised. 


II. Example 


Example 21.1 A psychologist conducts a study to determine whether or not noise can inhibit 
learning. Each of 15 subjects is randomly assigned to one of three groups. Each subject is given 
20 minutes to memorize a list of 10 nonsense syllables, which she is told she will be tested on the 
following day. The five subjects assigned to Group 1, the no noise condition, study the list of 
nonsense syllables while they are in a quiet room. The five subjects assigned to Group 2, the 
moderate noise condition, study the list of nonsense syllables while listening to classical music. 
The five subjects assigned to Group 3, the extreme noise condition, study the list of nonsense 
syllables while listening to rock music. The number of nonsense syllables correctly recalled by 
the 15 subjects follows: Group 1: 8, 10, 9, 10, 9; Group 2: 7,8, 5, 8, 5; Group 3: 4,8, 7, 5, 


7. Do the data indicate that noise influenced subjects’ performance? 
III. Null versus Alternative Hypotheses 


Null hypothesis Hy: by, = by = by 

(The mean of the population Group 1 represents equals the mean of the population Group 2 
represents equals the mean of the population Group 3 represents.) 

Alternative hypothesis H,: Not H, 


(This indicates there is a difference between at least two of the k = 3 population means. It 
is important to note that the alternative hypothesis should not be written as follows: 
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H: p, * p, * m. The reason why the latter notation for the alternative hypothesis is incor- 
rect is because it implies that all three population means must differ from one another in order 
to reject the null hypothesis. In this book it will be assumed (unless stated otherwise) that the 
alternative hypothesis for the analysis of variance is stated nondirectionally.^ In order to reject 
the null hypothesis, the obtained F value must be equal to or greater than the tabled critical F 
value at the prespecified level of significance.) 


IV. Test Computations 


The test statistic for the single-factor between-subjects analysis of variance can be computed 
with either computational or definitional equations. Although definitional equations reveal 
the underlying logic behind the analysis of variance, they involve considerably more calculations 
than the computational equations. Because of the latter, computational equations will be 
employed in this section to demonstrate the computation of the test statistic. The definitional 
equations for the single-factor between-subjects analysis of variance are described in Section 
VII. 

The data for Example 21.1 are summarized in Table 21.1. The scores of the n, - 5 
subjects in Group 1 are listed in the column labelled X,, the scores of the n, = 5 subjects in 
Group 2 are listed in the column labelled X,, and the scores of the n, = 5 subjects in Group 3 
are listed in the column labelled X,. Since there are an equal number of subjects in each group, 
the notation n is employed to represent the number of subjects per group. In other words, 
n = n, = n, = n4. The columns labelled x , x and x. list the squares of the scores of the 
subjects in each of the three groups. 


Table 21.1 Data for Example 21.1 


Group 1 Group 2 Group 3 
2 ne 2 
X, x X, X X, x 
8 64 7 49 4 16 
10 100 8 64 8 64 
9 81 5 25 7 49 
10 100 8 64 5 25 
9 81 5 25 7 49 
EX, = 46 Ext =426 EX, = 33 X; = 227 X= 31 X; = 203 
= x - XX = xX 
cac NU NE Xo. eg Z- ee 
n, 5 n, 5 n, 5 


The notation N represents the total number of subjects employed in the experiment. Thus: 
N=n +n +e +n, 
Since there are k = 3 groups: 


N-n*n-*«n-5«525-15 


The value XX, represents the total sum of the scores of the N subjects who participate in 
the experiment. Thus: 


XX, EX, BOX, e + XX 
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Since there are k = 3 groups, XX, = 110. 
XX, = UX, + UX, + LX, = 46 + 33 + 31 = 110 


_  X,, represents the grand mean, where X, = YXXN. Thus, X, = 110/15 2 7.33. Although 
X, is not employed in the computational equations to be described in this section, it is employed 
in some of the definitional equations described in Section VII. 

The value EX? represents the total sum of the squared scores of the N subjects who 
participate in the experiment. Thus: 


XX SX 4 3X3 965 4X, 


Since there are k = 3 groups, XX - 856. 


XX; = EX? + EX? + EX? = 426 + 227 + 203 = 856 


Although the group means are not required for computing the analysis of variance test 
statistic, itis recommended that they be computed since visual inspection of the group means can 
provide the researcher with a general idea of whether or not it is reasonable to expect a 
significant result. To be more specific, if two or more of the group means are far removed from 
one another, it is likely that the analysis of variance will be significant (especially if the number 
of subjects in each group is reasonably large). Another reason for computing the group means 
is that they are required for comparing individual groups with one another, something that is 
often done following the analysis of variance on the full set of data. The latter types of 
comparisons are described in Section VI. 

As noted in Section I, in order to compute the test statistic for the single-factor between- 
subjects analysis of variance, the total variability in the data is divided into between-groups 
variability and within-groups variability. In order to do this, the following values are com- 
puted: a) The total sum of squares which is represented by the notation SS;.; b) The between- 
groups sum of squares which is represented by the notation S$,.. The between-groups sum 
of squares is the numerator of the equation that represents between-groups variability (i.e., the 
equation that represents the amount of variability between the means of the k groups); and c) The 
within-groups sum of squares which is represented by the notation SS. The within-groups 
sum of squares is the numerator of the equation that represents within-groups variability (1.e., 
the equation that represents the average amount of variability within each of the k groups, which, 
as noted earlier, represents error variability). 


Equation 21.1 describes the relationship between SS,, SS,., and SS... 
SS. = SS,G + wg (Equation 21.1) 
Equation 21.2 is employed to compute SS... 
XX 
SS, = XX; - ex (Equation 21.2) 





Employing Equation 21.2, the value SS; = 49.33 is computed. 


2 
SS, = 856 - nn - 49.33 
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Equation 21.3 is employed to compute SS,.. In EATUaHon 21.3, the notation n, and YX, 
respectively, represent the values of n and XX for the j” group/sample. 


(Ex? 


s 


ES 
N 


SS 


BG ~ 
j4 


(Equation 21.3) 








The notation d 4[O X, yl n; ] in Equation 21.3 indicates that for each group the value 
ex, »/ n; is sonipited: and the latter values are summed for all k groups. When there are an 
equal number of subjects in each group (as is the case in Example 21.1), the notation n can be 
employed in Equation 21.3 in place of n..° 

With reference to Example 21.1, Equation 21.3 can be rewritten as follows: 


CX) Oxy » , Cy 


ny Ny n, 


(xx 
N 


SS 


BG ~ 











Substituting the appropriate values from Example 21.1 in Equation 21.3, the value 
SS,G = 26.53 is computed. 


SSoG = - 833.2 - 806.67 - 26.53 





ee , 8» , GD'| _ 10) 
5 5 5 15 


By algebraically transposing the terms in Equation 21.1, the value of SS yg can be com- 
puted with Equation 21.4. 
SS 


= SS, - SS 


BG (Equation 21.4) 


WG T 


Employing Equation 21.4, the value SS, = 22.80 is computed. 
SSwo = 49.33 - 26.53 = 22.80 


Since the value obtained with Equation 21.4 is a function of the values obtained with 
Equations 21.2 and 21.3, if the computations for either of the latter two equations are incorrect 
Equation 21.4 will not yield the correct value for SS. For this reason, one may prefer to 
compute the value of SS... with Equation 21.5. 


(Ex)? 


n; 





k 
SSwc = » 


j=l 


XX? - (Equation 21.5) 








The summation sign 5 ., in Equation 21.5 indicates that for each group the value 
XX; - ÈX, yl nj is sopira. and the latter values are summed for all k groups. With ref- 
e to Example 21.1, Equation 21.5 can be written as follows: 





EX y XXy XX 
SSwo = XX, - Sg xx, - uc P Ex - pog) 
Th n, n 




















© 2000 by Chapman & Hall/CRC 


Employing Equation 21.5, the value SS, = 22.80 is computed, which is the same 
value computed with Equation 21.4." 


SS - 22.80 





2 
426 - SOY 
5 





: be 2 37 
5 








: bo EU 
5 


The reader should take note of the fact that the values SS, SSpg» and SS yg must always 
be positive numbers. If a negative value is obtained for any of the aforementioned values, it 
indicates a computational error has been made. 

At this point the values of the between-groups variance and the within-groups variance 
can be computed. In the single-factor between-subjects analysis of variance, the between- 
groups variance is referred to as the mean square between-groups, which is represented by 


the notation MS,.. MS, is computed with Equation 21.6. 


SS ic 


MS,;* —— 
BG df, 


(Equation 21.6) 


The within-groups variance is referred to as the mean square within-groups, which is 
represented by the notation MS. MS, is computed with Equation 21.7. 


SS uc i 
MS, = (Equation 21.7) 


dfw 


Note that a total mean square is not computed. 

In order to compute MSpg and MS yg» it is required that the values df,,, and df, (the 
denominators of Equations 21.6 and 21.7) be computed. df,,, which represents the between- 
groups degrees of freedom, are computed with Equation 21.8. 


dfa;-k-1 (Equation 21.8) 

dfyg» Which represents the within-groups degrees of freedom, are computed with 
Equation 21.9.? 

Tyg =N-k (Equation 21.9) 

Although it is not required in order to determine the F ratio, the total degrees of freedom 

are generally computed, since it can be used to confirm the df values computed with Equations 

21.8 and 21.9, as well as the fact that it is employed in the analysis of variance summary table. 

The total degrees of freedom (represented by the notation df), are computed with Equation 


21.10.” 
df, =N-1 (Equation 21.10) 


The relationship between df,c. df... and df, is described by Equation 21.11. 
df, = Bee + Bye (Equation 21.11) 


Employing Equations 21.8-21.10, the values df,, = 2, df; = 12, and df, = 14 are 
computed. Note that df, = df,. + Uy = 2 + 12 = 14. 


dfg = 3-1=2  dfyg=15 -3 =12 d-15-1-14 
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Employing Equations 21.6 and 21.7, the values MS, = 13.27 and MS, = 1.9 are 
computed. 


26.53 22.8 
MS, = TM = 13.27 MSyq = = = 19 


The F ratio, which is the test statistic for the single-factor between-subjects analysis of 
variance, is computed with Equation 21.12. 


MSG 
MS 





F= (Equation 21.12) 


WG 


Employing Equation 21.12, the value F = 6.98 is computed. 


F = 13:27 .698 
19 


The reader should take note of the fact that the values MS HO MS wc: and F must always 
be positive numbers. If a negative value is obtained for any of the aforementioned values, it 
indicates a computational error has been made. If MS, = 0, Equation 21.12 will be in- 
soluble. The only time MS yg = 0 is when within each group all subjects obtain the same score 
(i.e., there is no within-groups variability). If all of the groups have the identical mean value, 


MS, = 0, and if the latter is true, F = 0. 
V. Interpretation of the Test Results 


Itis common practice to summarize the results of a single-factor between-subjects analysis of 
variance with the summary table represented by Table 21.2. 


Table 21.2. Summary Table of Analysis of Variance for Example 21.1 


Source of variation SS df MS F 
Between-groups 26.53 2 13.27 6.98 
Within-groups 22.80 12 1.90 

Total 49.33 14 


The obtained value F = 6.98 is evaluated with Table A10 (Table of the F Distribution) 
in the Appendix. In Table A10 critical values are listed in reference to the number of degrees 
of freedom associated with the numerator and the denominator of the F ratio (i.e., df um and 
df. JN ig In employing the F distribution in reference to Example 21.1, the degrees of freedom 
for the numerator are df, = 2 and the degrees of freedom for the denominator are dfyg = 12. 
In Table A10 the tabled F,. and F,, values are, respectively, employed to evaluate the 
nondirectional alternative hypothesis H,: Not H, at the .05 and .01 levels. Throughout the 
discussion of the analysis of variance the notation F ,, is employed to represent the tabled critical 
F value at the .05 level. The latter value corresponds to the relevant tabled F,. value in Table 
A10. In the same respect, the notation F y, is employed to represent the tabled critical F value 
at the .01 level, and corresponds to the relevant tabled F „į value in Table A10. 

For df, = 2 and dfi, = 12, the tabled F,, and F,, values are F,, = 3.89 and 
F = 6.93. Thus, F; = 3.89 and F = 6.93. In order to reject the null hypothesis, the 


obtained F value must be equal to or greater than the tabled critical value at the prespecified level 
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of significance. Since F = 6.98 is greater than F, = 3.89 and F = 6.93, the alternative 
hypothesis is supported at both the .05 and .01 levels. 

A summary of the analysis of Example 21.1 with the single-factor between-subjects 
analysis of variance follows: It can be concluded that there is a significant difference between 
at least two of the three groups exposed to different levels of noise. This result can be sum- 
marized as follows: F(2, 12) = 6.98, p < .01. 


VI. Additional Analytical Procedures for the Single-Factor Between- 
Subjects Analysis of Variance and/or Related Tests 


1. Comparisons following computation of the omnibus F value for the single-factor 
between-subjects analysis of variance The F value computed with the analysis of variance is 
commonly referred to as the omnibus F value. The latter term implies that the obtained F value 
is based on an evaluation of all k group means. Recollect that in order to reject the null hy- 
pothesis, it is only required that at least two of the k group means differ significantly from one 
another. As a result of this, the omnibus F value does not indicate whether just two or, in fact, 
more than two groups have mean values that differ significantly from one another. In order to 
answer this question it is necessary to conduct additional tests, which are referred to as com- 
parisons (since they involve comparing the means of two or more groups with one another). 

Researchers are not in total agreement with respect to the appropriate protocol for 
conducting comparisons." The basis for the disagreement revolves around the fact that each 
comparison one conducts increases the likelihood of committing at least one Type I error within 
a set of comparisons. For this reason, it can be argued that a researcher should employ a lower 
Type I error rate per comparison to insure that the overall likelihood of committing at least one 
Type I error in the set of comparisons does not exceed a prespecified alpha value that is 
reasonably low (e.g., a = .05). At this point in the discussion the following two terms are 
defined: a) The familywise Type I error rate (represented by the notation @,y) is the 
likelihood that there will be at least one Type I error in a set of c comparisons;" and b) The per 
comparison Type I error rate (represented by the notation a.) is the likelihood that any single 
comparison will result in a Type I error. 

Equation 21.13 defines the relationship between the familywise Type I error rate and the 
per comparison Type I error rate, where c = the number of comparisons.“ 


Ory = 1 - (1 - ay (Equation 21.13) 


Let us assume that upon computing the value F = 6.98 for Example 21.1, the researcher 
decides to compare each of the three group means with one another — i.e., X, versus X,; X, 
versus X,; X, versus X}. The three aforementioned comparisons can be conceptualized as a 
family/set of comparisons, with c = 3. If foreach of the comparisons the researcher establishes 
the value a, = .05, employing Equation 21.13 it can be determined that the familywise Type 
I error rate will equal à, = .14. This result tells the researcher that the likelihood of 
committing at least one Type I error in the set of three comparisons is .14. 


Ory 71-(1-.05f = .14 
Equation 21.14, which is computationally more efficient than Equation 21.13, can be 
employed to provide an approximation of a. 
Ory = (capo) (Equation 21.14) 
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Employing Equation 21.14, the value «py = .15 is computed for Example 21.1. 
Ory = (3)(.05) = .15 


Note that the familywise Type I error rate «py = .14 is almost three times the value of 
the per comparison Type I error rate «po = .05. Of greater importance is the fact that the 
value 0,4, = .14 is considerably higher than .05, the usual maximum value permitted for a Type 
I error rate in hypothesis testing. Thus, some researchers would consider the value «py = .14 
to be excessive, and if, in fact, a maximum familywise Type I error rate of «pẹ = .05 is 
stipulated by a researcher, it is required that the Type I error rate for each comparison (i.e., 
Op) be reduced. Through use of Equation 21.15 or Equation 21.16 (which are, respectively, the 
algebraic transpositions of Equation 21.13 and Equation 21.14), it can be determined that in order 
to have 0,4, = .05, the value of «pç must equal .017.' 


Ope = 1 - VE c Ma, = 1 - VI - 05 = .017 (Equation 21.15) 


Ope = N= DT. = 0167 (Equation 21.16) 


The reader should take note of the fact that although a reduction in the value of @pw 
reduces the likelihood of committing a Type I error, it increases the likelihood of committing a 
Type II error (i.e., not rejecting a false null hypothesis). Thus, as one reduces the value of 
Oy, the power associated with each of the comparisons that is conducted is reduced. In view 
of this, it should be apparent that if a researcher elects to adjust the value of & pẹ, he must con- 
sider the impact it will have on the Type I versus Type II error rates for all of the comparisons 
that are conducted within the set of comparisons. 

A number of different strategies have been developed with regard to what a researcher 
should do about adjusting the familywise Type I error rate. These strategies are employed 
within the framework of the following two types of comparisons that can be conducted following 
an omnibus F test: planned comparisons versus unplanned comparisons. The distinction 
between planned and unplanned comparisons follows. 

Planned comparisons (also known as a priori comparisons) Planned comparisons are 
comparisons a researcher plans prior to collecting the data for a study. In a well designed exper- 
iment one would expect that a researcher will probably predict differences between specific 
groups prior to conducting the study. As a result of this, there is general agreement that 
following the computation of an omnibus F value, a researcher is justified in conducting any 
comparisons which have been planned beforehand, regardless of whether or not the omnibus F 
value is significant. Although most sources state that when planned comparisons are conducted 
it is not necessary to adjust the familywise Type I error rate, under certain conditions (such as 
when there are a large number of planned comparisons) an argument can be made for adjusting 
the value of @ pw- 

In actuality, there are two types of planned comparisons that can be conducted, which are 
referred to as simple comparisons versus complex comparisons. A simple comparison is any 
comparison in which two groups are compared with one another. For instance: Group 1 versus 
Group 2 (i.e., X, versus X), which allows one to evaluate the null hypothesis Hy: p, = p). 
Simple comparisons are often referred to as pairwise comparisons. A complex comparison 
is any comparison in which the combined performance of two or more groups is compared with 
the performance of one of the other groups or the combined performance of two or more of the 
other groups. For instance: Group 1 versus the average of Groups 2 and 3 (i.e., X, versus 


© 2000 by Chapman & Hall/CRC 


(X, + X,)/2, which evaluates the null hypothesis Hy: p, = (u, + p;)/2). If there are four 
groups, one can conduct a complex comparison involving the average of Groups 1 and 2 versus 
the average of Groups 3 and 4 (i.e, (X, + X,)/2 versus (X, + X,)/2, which evaluates the 
null hypothesis Hy: (p, + py)/2 = (m, + i42). 

It should be noted that if the omnibus F value is significant, it indicates there is at least one 
significant difference among all of the possible comparisons that can be conducted. Kirk (1982, 
1995) and Maxwell and Delaney (1990), among others, note that in such a situation it is theo- 
retically possible that none of the simple comparisons are significant, and that the one (or perhaps 
more than one) significant comparison is a complex comparison. It is important to note that 
regardless of what type of comparisons a researcher conducts, all comparisons should be mean- 
ingful within the context of the problem under study, and as, a general rule, comparisons should 
not be redundant with respect to one another. 

Unplanned comparisons (also known as post hoc, multiple,or a posteriori comparisons) 
Anunplanned comparison (which can be either a simple or complex comparison) is a compari- 
son a researcher decides to conduct after collecting the data for a study. In conducting un- 
planned comparisons, following the data collection phase of a study a researcher examines the 
values of the k group means, and at that point decides which groups to compare with one another. 
Although for many years most researchers argued that unplanned comparisons should not be 
conducted unless the omnibus F value is significant, more recently many researchers (including 
this author) have adopted the viewpoint that it is acceptable to conduct unplanned comparisons 
regardless of whether or not a significant F value is obtained. Although there is general agree- 
ment among researchers that the familywise Type I error rate should be adjusted when 
unplanned comparisons are conducted, there is a lack of consensus with regard to the degree of 
adjustment that is required. This is reflected in the fact that a variety of unplanned comparison 
procedures have been developed, each of which employs a different method for adjusting the 
value of &pẹ. More often than not, when unplanned comparisons are conducted, a researcher 
will compare each of the k groups with all of the other (k — 1) groups (i.e., all possible com- 
parisons between pairs of groups are made). This “shotgun” approach, which maximizes the 
number of comparisons conducted, represents the classic situation for which most sources argue 
it is imperative to control the value of Oy. 

The rationale behind the argument that it is more important to adjust the value of &,,, in 
the case of unplanned comparisons as opposed to planned comparisons will be illustrated with 
asimple example." Let us assume that a set of data is evaluated with an analysis of variance, and 
a significant omnibus F value is obtained. Let us also assume that within the whole set of data 
it is possible to conduct 20 comparisons between pairs of means and/or combinations of means. 
If the truth were known, however, no differences exist between the means of any of the popu- 
lations being compared. However, in spite of the fact that none of the population means differ, 
within the set of 20 possible comparisons, one comparison (specifically, the one involving 
X, versus X,) results in a significant difference at the .05 level. In this example we will assume 
that the difference X, - X, is the largest difference between any pair of means or combination 
of means in the set of 20 possible comparisons. If, in fact, u; = p,, a significant result ob- 
tained for the comparison X, versus X, will represent a Type I error. 

Let us assume that the comparison X, versus X, is planned beforehand, and it is the only 
comparison the researcher intends to make. Since there are 20 possible comparisons, the 
researcher has only a 1 in 20 chance (i.e., .05) of conducting the comparison Group 1 versus 
Group 2, and in the process commits a Type I error. If, on the other hand, the researcher does 
not plan any comparisons beforehand, but after computing the omnibus F value decides to make 
all 20 possible comparisons, he has a 100% chance of making a Type I error, since it is certain 
he will compare Groups 1 and 2. Even if the researcher decides to make only one unplanned 
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comparison — specifically, the one involving the largest difference between any pair of means 
or combination of means — he will also have a 10046 chance of committing a Type I error, since 
the comparison he will make will be X, versus X,. This example illustrates that from a 
probabilistic viewpoint, the familywise Type I error rate associated with a set of unplanned 
comparisons will be higher than the rate for a set of planned comparisons. 

The remainder of this section will describe the most commonly recommended comparison 
procedures. Specifically, the following procedures will be described: a) Linear contrasts; b) 
Multiple ¢ tests/Fisher's LSD test; c) The Bonferroni-Dunn test; d) Tukey’s HSD test; e) 
The Newman-Keuls test; f) The Scheffé test; and g) The Dunnett test. Although any of the 
aforementioned comparison procedures can be used for both planned and unplanned 
comparisons, linear contrasts and multiple t tests/Fisher's LSD test (which do not control the 
value of & py) are generally described within the context of planned comparisons. Some sources, 
however, do employ the latter procedures for unplanned comparisons. The Bonferroni-Dunn 
test, Tukey's HSD test, the Newman-Keuls test, the Scheffé test, and the Dunnett test are 
generally described as unplanned comparison procedures that are employed when a researcher 
wants to control the value of ,,,. The point that will be emphasized throughout the discussion 
to follow is that the overriding issue in selecting a comparison procedure is whether or not the 
researcher wants to control the value of œw, and if so, to what degree. This latter issue is 
essentially what determines the difference between the various comparison procedures to be 
described in this section. 


Linear contrasts Linear contrasts (which are almost always described within the frame- 
work of planned comparisons) are comparisons that involve a linear combination of population 
means. In the case of a simple comparison, a linear contrast consists of comparing two of the 
group means with one another (e.g., X, versus X,). In the case of a complex comparison, the 
combined performance of two or more groups is compared with the performance of one of the 
other groups or the combined performance of two or more of the other groups (e.g., X, versus 
(X, + X,)/2). In conducting both simple and complex comparisons, the researcher must assign 
weights, which are referred to as coefficients, to all of the group means. These coefficients 
reflect the relative contribution of each of the group means to the two mean values that are being 
contrasted with one another in the comparison. In the case of a complex comparison, at least one 
of the two mean values that are contrasted in the comparison will be based on a weighted com- 
bination of the means of two or more of the groups. 

The use of linear contrasts will be described for both simple and complex comparisons. 
In the examples that will be employed to illustrate linear contrasts, it will be assumed that all 
comparisons are planned beforehand, and that the researcher is making no attempt to control the 
value of &,,. Thus, for each comparison to be conducted it will be assumed that «po = .05. 
All of the comparisons (both simple and complex) to be described in this section are referred to 
as single degree of freedom (df) comparisons. This is the case, since one degree of freedom 
is always employed in the numerator of the F ratio (which represents the test statistic for a 
comparison). '® 


Linear contrast of a planned simple comparison Let us assume that prior to obtaining the 
data for Example 21.1, the experimenter hypothesizes there will be a significant difference 
between Group 1 (no noise) and Group 2 (moderate noise). After conducting the omnibus F 
test, the simple planned comparison X, versus X, is conducted to compare the performance of 
the two groups. The null and alternative hypotheses for the comparison follow: Hy: p, = m 
versus Hi: p, # p, .? Table 21.3 summarizes the information required to conduct the planned 
comparison X, versus X,. 


© 2000 by Chapman & Hall/CRC 


Table 21.3 Planned Simple Comparison: Group 1 Versus Group 2 


Squared 
B Coefficient Product Coefficient 
Group x (c) (CXX) (cj) 

1 9.2 +1 (+1)(9.2) = 49.2 1 

2 6.6 -1 (-1)(6.6) = -6.6 1 

3 6.2 0 (0)(6.2)= 0 0 

v 2 
Xc, =0 X(c)(X)) = 2.6 Xc; =2 


The following should be noted with respect to Table 21.3: a) The rows of Table 21.3 
represent data for each of the three groups employed in the experiment. Even though the 
comparison involves two of the three groups, the data for all three groups are included to 
illustrate how the group which is not involved in the comparison is eliminated from the 
calculations; b) Column 2 contains the mean score of each of the groups; c) In Column 3 each 
of the groups is assigned a coefficient, represented by the notation c;. The value of c; assigned 
to each group is a weight that reflects the proportional contribution of the group to the 
comparison. Any group not involved in the comparison (in this instance Group 3) is assigned 
a coefficient of zero. Thus, c - 0. When only two groups are involved in a comparison, one 
of the groups (it does not matter which one) is assigned a coefficient of +1 (in this instance 
Group 1 is assigned the coefficient c, = +1) and the other group a coefficient of -1 (i.e., 
C, = -1) Note that Èc., the sum of the coefficients (which is the sum of Column 3), must 
always equal zero (ie., c, + c, + c4 = (+1) + (-1) + 0 = 0); d) In Column 4 a product is 
obtained for each group. The product for a group is obtained by multiplying the mean of the 
group (X; ) by the coefficient that has been assigned to that group (c; ). Although it may not be 
immediaely apparent from looking at the table, the sum of Column 4, Mc, MX; )» is, in fact, the 


difference between the two means being compared (i.e., Mc; XX, ) is equal to X, = X, =9,2- 
6.6 = 2.6)? and e) Ec? , the sum of Column 5, is the sum of thè squared coeffietents. 


The test statistic for the comparison is an F ratio, represented by the notation B op In 
order to compute the value F a sum of squares (SS... ), a degrees of freedom value 


comp? 


comp 
(dfeomp? and a mean square (MS op) for the comparison must be computed. The comparison 
sum of squares (55 oa)! is computed with Equation 21. I^ Note that Equation 21.17 assumes 


there are an equal number of subjects (n) in each group.?! 





S 
nid(c, (X, 

SS ons = n [co] (Equation 21.17) 

Ec 
Substituting the appropriate values from Example 21.1 in Equation 21.17, the value 
SS p = 16.9 is computed. 

2 

$$. = ES = 16.9 

comp 2 


The comparison mean square (MS...) is computed with Equation 21.18. MS 


comp comp 
represents a measure of between-groups variability which takes into account just the two group 


means involved in the comparison. 
SS 


MS p ee Equation 21.18 
comp df, - ( q ) 





© 2000 by Chapman & Hall/CRC 


In a single degree of freedom comparison, df... will always equal 1 since the number 


comp 
of mean values being compared in such a comparison will always be A eiit - 2, and 
df. um I sut - 1 =2 - 1 = I. Substituting the values SS comp = 16.9 and a us = lin 


Equation 21.18, the value MS... = 16.9 is computed. Note that since in a single degree of 
freedom comparison the value of df. gis will always equal 1, the values SS oomp and MS ois 
will always be equivalent. 

MS, a, = 1282 = 169 


comp 1 


The test statistic P. onn is computed with Equation 21.19. F ob is a ratio that is comprised 
of the variability of the two means involved in the comparison divided by the within-groups 


variability employed for the omnibus F test.? 





MS 
Fom = R (Equation 21.19) 
? MSyo 
Substituting the values MS in = 16.9 and MS, = 1.9 in Equation 21.19, the value 
Fm = 9.89 is computed. 
. 169 _ 8.89 
comp 1.9 . 


The value Foyny) = 8.89 is evaluated with Table A10. Employing Table A10, the 


appropriate degrees of freedom value for the numerator is df am = 1. This is the case, since the 


numerator of the Pip Fato is MS up and the degrees of freedom associated with the latter 
value is dj am 1. The denominator degrees of freedom will be the value of dfyg employed 


for the omnibus F test, which for Example 21.1 is dfy, = 12. 


For df,,, = 1 and dfi, = 12, the tabled critical .05 and .01 F values are Fos = 4.75 
and Fo = 9.33. Since the obtained value ous 8.89 is greater than F; = 4.75, the 


nondirectional alternative hypothesis H,: pu, * p, is supported at the .05 level. Since 
E 8.89 is less than Fs, = 9.33, the latter alternative hypothesis is not supported at the 
.01 level. Thus, if the value a = .05 is employed, the researcher can conclude that Group 1 
recalled a significantly greater number of nonsense syllables than Group 2. 

With respect to Example 21.1, it is possible to conduct the following additional simple 
comparisons: X, versus X,; X, versus X,. The latter two simple comparisons will be con- 
ducted later employing multiple ¢ tests/Fisher's LSD test (which, as will be noted, are 


computationally equivalent to the linear contrast procedure described in this section). 


Linear contrast of a planned complex comparison Let us assume that prior to obtaining 
the data for Example 21.1, the experimenter hypothesizes there will be a significant difference 
between the performance of Group 3 (extreme noise) and the combined performance of Group 
1 (no noise) and Group 2 (moderate noise). Such a comparison is a complex comparison, 
since it involves a single group being contrasted with two other groups. As is the case with 
the simple comparison X, versus X,, two means are also contrasted within the framework of the 
complex comparison. However, one of the two means is a composite mean that is based upon 
the combined performance of two groups. The complex comparison represents a single degree 
of freedom comparison. This is the case, since two means are being contrasted with one another 
— specifically, the mean of Group 3 with the composite mean of Groups 1 and 2. The fact 
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that it is a single degree of freedom comparison is also reflected in the fact that there is one 
equals sign (=) in both the null and alternative hypotheses for the comparison. The null and 
alternative hypotheses for the complex comparison are Hy: p, = (ug, + u,)/2 versus 
H: p, + (uw)! 2.? Table 21.4 summarizes the information required to conduct the 
planned complex comparison X, versus (X, + X,)/2. 


Table 21.4 Planned Complex Comparison: Group 3 Versus Groups 1 and 2 


Squared 
= Coefficient Product Coefficient 
Grop — (9 (c) (c) (X) (cj) 
1 9.2 i 1 E" 1 
; [-4]@2)=-46 : 
2 6.6 1 (d e di 
; [-369- 33 : 
3 6.2 +1 (41)(6.2) = +6.2 1 
= Y Bea 2 = 
Xe, = 0 XcQyX)--17  Xg = 15 


Note that the first two columns of Table 21.4 are identical to the first two columns of Table 
21.3. The different values in the remaining columns of Table 21.4 result from the fact that 
different coefficients are employed for the complex comparison. The absolute value of 
XX) = -1.7 represents the difference between the two sets of means contrasted in the null 
hypothesis — specifically, the difference between X, = 6.2 and the composite mean of Group 
1 (X, = 9.2) and Group 2 (X, = 6.6). The latter composite mean will be represented with 
the notation Xj. Since X,,, = (9.2 + 6.6)/2 = 7.9, the difference between the two means 
evaluated with the comparison is 6.2 — 7.9 = -1.7 (which is the same as the value 
LX) = -].7, which is the sum of Column 4 in Table 21.4). 

Before conducting the computations for the complex comparison, a general protocol will 
be described for assigning coefficients to the groups involved in either a simple or complex 
comparison. Within the framework of describing the protocol, it will be employed to determine 
the coefficients for the complex comparison under discussion. 

1) Write out the null hypothesis (i.e., Hy: p, = (pg, + p,)/2). Any group not involved in 
the comparison (i.e., not noted in the null hypothesis) is assigned a coefficient of zero. Since all 
three groups are included in the present comparison, none of the groups receives a coefficient 
of zero. 

2) On each side of the equals sign of the null hypothesis write the number of group means 
designated in the null hypothesis. Thus: 


By + d 
2 


Ay: m = 


1 mean 2 means 


3) To obtain the coefficient for each of the groups included in the null hypothesis, employ 
Equation 21.20. The latter equation, which is applied to each side of the null hypothesis, 
represents the reciprocal of the number of means on a specified side of the null hypothesis.” 


Coefficient = ———___! (Equation 21.20) 


Number of group means 
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Since there is only one mean to the left of the equals sign (u, ), employing Equation 21.20 
we determine that the coefficient for that group (Group 3) equals — = 1. Since there are two 
means to the right of the equals sign (p; and p,), using Equation 21.20 we determine ms 
coefficient for both of the groups to the right of the equals sign (Groups 1 and 2) equals ; =. 
Notice that all groups on the same side of the equals sign receive the same coefficient.” 

4) The coefficient(s) on one side of the equals sign are assigned a positive sign, and the 
coefficient(s) on the other side of the equals sign are assigned a negative sign. Equivalent results 
will be obtained irrespective of which side of the equals sign is assigned positive versus negative 
coefficients. In the complex comparison under discussion, a positive sign is assigned to the 
coefficient to the left of the equals sign, and negative signs are assigned to coefficients to the 
right of the equals sign. Thus, the values of the coefficients are: c, =—1/2; c, =-1/2; c, = +1. 
Note that the sum of the coefficients must always equal zero (i.e., c, + c, + c, 2 (-1/2) + (-1/2) 
+ 126 

Equations 21.17—21.19, which are employed for the simple comparison, are also used to 
evaluate a complex comparison. Substituting the appropriate information from Table 21.4 in 
Equation 21.17, the value SS = 9.63 is computed. 


comp 


Employing Equation 21.18, the value MS = 9.63 is computed. Note that since the 


com] 


complex comparison is a single degree of freedom comparison, df, "ES 1. 
MS,,,,, = ZÊ = 9.63 
comp 1 


Employing Equation 21.19, the value F = 5.07 is computed. 


comp 


- 263 . 5.07 


comp 1 . 9 


The protocol for evaluating the value F cad ^ 5.07 computed for the complex comparison 
is identical to that employed for the simple comparison. In determining the tabled critical F value 
in Table A10, the same degrees of freedom values are employed. This is the case since the 
numerator degrees of freedom for any single degree of freedom comparison is df, P = 1. The 
denominator degrees of freedom is dfg» which, as in the case of the simple comparison, is 
the value of df,,, employed for the omnibus F test. Thus, the appropriate degrees of freedom 
for the complex comparison are df... = 1 and dfin = 12. The tabled critical .05 and .01 F 
values in Table A10 for the latter degrees of freedom are F = 4.75 and Fy, = 9.33. Since 
the obtained value F = 5.07 is greater than F; = 4.75, the nondirectional alternative 
hypothesis H,: p, * (u, + m,)/2 is supported at the .05 level. Since F = 5.07 is less than 
Fg = 9.33, the latter alternative hypothesis is not supported at the .01 level. Thus, if the value 
a= "05 is employed, the researcher can conclude that Group 3 recalled a significantly fewer 
number of nonsense syllables than the average number recalled when the performance of Groups 


1 and 2 are combined. 
Orthogonal comparisons Most sources agree that intelligently planned studies involve a limited 


number of meaningful comparisons which the researcher plans prior to collecting the data. As a 
general rule, any comparisons that are conducted should address critical questions underlying 
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the general hypothesis under study. Some researchers believe that it is not even necessary to 
obtain an omnibus F value if, in fact, the critical information one is concerned with is contained 
within the framework of the planned comparisons. It is generally recommended that the 
maximum number of planned comparisons one conducts should not exceed the value of df, 
employed for the omnibus F test. If the number of planned comparisons is equal to or less than 
df,c, sources generally agree that a researcher is not obliged to adjust the value of a. 
When, however, the number of planned comparisons exceeds df,,,, many sources recommend 
that the value of &,,,, be adjusted. Applying this protocol to Example 21.1, since df,, = 2, one 
can conduct two planned comparisons without being obliged to adjust the value of Oy. 

The subject of orthogonal comparisons is relevant to the general question of how many 
(and specifically which) comparisons a researcher should conduct. Orthogonal comparisons are 
defined as comparisons that are independent of one another. In other words, such comparisons 
are not redundant, in that they do not overlap with respect to the information they provide. In 
point of fact, the two comparisons that have been conducted (i.e., the simple comparison of 
Group 1 versus Group 2, and the complex comparison of Group 3 versus the combined 
performance of Groups 1 and 2) are orthogonal comparisons. This can be demonstrated by 
employing Equation 21.21, which defines the relationship that will exist between two 
comparisons if they are orthogonal to one another. In Equation 21.21, c,, is the coefficient 
assigned to Group j in Comparison 1 and Ci is the coefficient assigned to Group j in 
Comparison 2. If, in fact, two comparisons are orthogonal, the sum of the products of the 
coefficients of all of k groups will equal zero. 


k 
LE ey=0 (Equation 21.21) 
1 


j= 


Equation 21.21 is employed below with the two comparisons that have been conducted in 
this section. Notice that for each group, the first value in the parentheses is the coefficient for 
that group for the simple comparison (Group 1 versus Group 2), while the second value in 
parentheses is the coefficient for that group for the complex comparison (Group 3 versus Groups 
1 and 2). 


Group 1 Group 2 Group 3 


(+1)(- 3) + Cup 3) + (0) (+1) = |- 3) + (+ 3) +0=0 


If there are k treatments, there will be (k — 1) (which corresponds to dfg employed for 
the omnibus F test) orthogonal comparisons (also known as orthogonal contrasts) within each 
complete orthogonal set. This is illustrated by the fact that two comparisons comprise the 
orthogonal set demonstrated above — specifically, Group 1 versus Group 2, and Group 3 versus 
Groups | and 2. Actually, when (as is the case in Example 21.1) there are k = 3 treatments, there 
are three possible sets of orthogonal comparisons — each set being comprised of one simple 
comparison and one complex comparison. In addition to the set noted above, the following two 
additional orthogonal sets can be formed: a) Group 1 versus Group 3; Group 2 versus Groups 
1 and 3; b) Group 2 versus Group 3; Group 1 versus Groups 2 and 3. 

When k > 3 there will be more than 2 contrasts in a set of orthogonal contrasts. Within that 
full set of orthogonal contrasts, if the coefficients from any two of the contrasts are substituted 
in Equation 21.21, they will yield a value of zero. It should also be noted that the number of 
possible sets of contrasts will increase as the value of k increases. It is important to note, 
however, that when all possible sets of contrasts are considered, most of them will not be 
orthogonal. With respectto determining those contrasts that are orthogonal, Howell (1992, 1997) 
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describes a simple protocol that can be employed to derive most (although not all) orthogonal 
contrasts in a body of data. The procedure described by Howell (1992, 1997) is summarized in 
Figure 21.1. 


Group 1 Group 2 Group 3 
x x | 
Group 1 and Group 2 versus Group 3 (Contrast 1) 
Group 1 versus Group 2 (Contrast 2) 


Figure 21.1 Tree Diagram for Determining Orthogonal Contrasts 


In employing Figure 21.1, initially two blocks of groups are formed employing all k groups. 
A block can be comprised of one or more of the groups. In Figure 21.1, the first block is com- 
prised of Groups 1 and 2, and the second block of Group 3. This will represent the first contrast, 
which corresponds to the complex comparison that is described in this section. Any blocks that 
remain which are comprised of two or more groups are broken down into smaller blocks. Thus, 
the block comprised of the Group 1 and Group 2 is broken down into two blocks, each consisting 
of one group. The contrast of these two groups (Group 1 versus Group 2) represents the second 
contrast in the orthogonal set. 

Figure 21.1 can also be employed to derive the other two possible orthogonal sets for 
Example 21.1. To illustrate, the initial two blocks derived can be a block consisting of Groups 
1 and 3 and a second block consisting of Group 2. This represents the complex comparison of 
Group 2 versus Groups 1 and 3. The remaining block consisting of Groups 1 and 3 can be 
broken down into two blocks consisting of Group 1 and Group 3. The comparison of these two 
groups, which represents a simple comparison, constitutes the second comparison in that 
orthogonal set. 

Note that once a group has been assigned to a block, and that block is compared to an 
adjacent block, from that point on any other comparisons involving that group will be with other 
groups that fall within its own block. Thus, in our example, if the first comparison is Groups 1 
and 2 versus Group 3, the researcher cannot use the comparison Group 1 versus Group 3 as the 
second comparison for that set, since the two groups are in different blocks. If the latter com- 
parison is conducted, the sum of the products of the coefficients of all k groups for the two 
comparisons will not equal zero, and thus not constitute an orthogonal set. To illustrate this latter 
fact, the coefficients of the three groups for the simple comparison depicted in Table 21.3 are 
rearranged as follows: c, = +1, c, = 0, and c, = -1. The latter coefficients are employed 
if the simple comparison Group 1 versus Group 3 is conducted. Equation 21.21 is now employed 
to demonstrate that the Group 1 versus Group 3 comparison is not orthogonal to the complex 
comparison Group 3 versus Groups 1 and 2 summarized in Table 21.4. 


Group 1 Group 2 Group 3 


eDi) + C3) + C969 - C30 69 ad 


2 2 2 


It should be pointed out that when more than one set of orthogonal comparisons are con- 
ducted, since the different sets are not orthogonal to one another, many of the comparisons one 
conducts will not be independent of one another. For this reason, a researcher who does not want 
to conduct any nonindependent comparisons should only conduct those comparisons involving 
the orthogonal set which provide the most meaningful information with regard to the hypothesis 
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under study. It should be noted, however, that there is no immutable rule that states that a 
researcher can only conduct orthogonal comparisons. Many sources point out there are times 
when the questions addressed by nonorthogonal comparisons can often contribute to a re- 
searcher' s understanding of the general hypothesis under study. 

Another characteristic of orthogonal comparisons is that the sum of squares for all com- 
parisons that comprise a set of orthogonal comparisons equals the value of SS, for the omnibus 
F test. This reflects the fact that the variability in a set of orthogonal contrasts will account for 
all the between-groups variability in the full set of data. Employing the data for the simple 
comparison summarized in Table 21.3 and the complex comparison summarized in Table 21.4, 
itis confirmed below that the sum of squares for the latter two comparisons (which comprise an 
orthogonal set) equals the value SS, = 26.53 obtained with Equation 21.3. 


S$,4 = 26.53 = SS 


A : + " 
simple comparison SS complex comparison 


= 16.9 + 9.63 = 26.53 


Test 21a: Multiple ¢ tests/Fisher's LSD test One option a researcher has available after 
computing an omnibus F value is to run multiple ¢ tests (specifically, the ¢ test for two in- 
dependent samples), in order to determine whether there is a significant difference between any 
of the pairs of means that can be contrasted within the framework of either simple or complex 
comparisons. In point of fact, it can be algebraically demonstrated that in the case of both simple 
and complex comparisons, the use of multiple ¢ tests will yield a result that is equivalent to that 
obtained with the protocol described for conducting linear contrasts. When multiple ¢ tests are 
discussed as a procedure for conducting comparisons, most sources state that: a) Multiple tests 
should only be employed for planned comparisons; b) Since multiple ¢ tests are only employed 
for planned comparisons, they can be conducted regardless of whether or not the omnibus F 
value is significant; and c) In conducting multiple ¢ tests for planned comparisons, the researcher 
is not required to adjust the value of 0,,,, as long as a limited number of comparisons are con- 
ducted (as noted earlier, most sources state the number of planned comparisons should not 
exceed df,,). All of the aforementioned stipulations noted for multiple ¢ tests also apply to 
linear contrasts (since as noted above, multiple ¢ tests and linear contrasts are computationally 
equivalent). 

When, on the other hand, comparisons are unplanned and multiple ¢ tests are employed to 
compare pairs of means, the use of multiple ¢ tests within the latter context is referred to as 
Fisher's LSD test (the term LSD is an abbreviation for least significant difference). When 
Fisher's LSD test is compared with other unplanned comparison procedures, it provides the 
most powerful test with respect to identifying differences between pairs of means, since it does 
not adjust the value of «,,,. Of all the unplanned comparison procedures, Fisher’s LSD test 
requires the smallest difference between two means in order to conclude that a difference is 
significant. However, since Fisher’s LSD test does not reduce the value of &pw, it has the 
highest likelihood of committing one or more Type I errors in a set/family of comparisons. In 
the discussion to follow, since multiple ¢ tests and Fisher’s LSD method are computationally 
equivalent (as well as equivalent to linear contrasts), the term multiple ¢ tests/Fisher’s LSD 
test will refer to a computational procedure which can be employed for both planned and 
unplanned comparisons that does not adjust the value of & pw- 

Equation 21.22 can be employed to compute the test statistic (which employs the f dis- 
tribution) for multiple: tests/Fisher's LSD test. Whereas Equation 21.22 can only be employed 
for simple comparisons, Equation 21.23 is a generic equation that can be employed for both 
simple and complex comparisons. It will be assumed that the null and alternative hypotheses for 
any comparisons being conducted are as follows: H,: p, = p, versus H,: p, * Pp 
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jeg (Equation 21.22) 


Where: X, and X, represent the two means contrasted in the comparison 


X, 7 x. 
[| = (Equation 21.23) 


(Xc;)(MS we) 
n 


In the case of a simple comparison, the value Ec in Equation 21.23 will always equal 2, 
thus resulting in Equation 21.22.” 

The degrees of freedom employed in evaluating the t value computed with Equations 21.22 
and 21.23 is the value of dfi; computed for the omnibus F test. Thus, in the case of Example 
21.1, the value dfi; = 12 is employed. Note that in Equations 21.22/21.23, the value MS, 
computed for the omnibus F test is employed in computing the standard error of the difference 
in the denominator of the ¢ test equation, as opposed to the value (S, » In) + ($2 /n,), which is 
employed in Equation 11.1 (the equation for the £ test for two independent samples when 
n, = n,). This is the case, since MS, is a pooled estimate of the population variance based on 
the full data set (i.e., the k groups for which the omnibus F value is computed).?* 

Equation 21.22 is employed below to conduct the simple comparison of Group 1 versus 
Group 2. 
t= 9.2 - 6.6 _ 2.99 
(2)(1.9) 


5 


The obtained value f= 2.99 is evaluated with Table A2 (Table of Student's t Distribution) 
in the Appendix. For df= 12, the tabled critical two-tailed .05 and .01 values are £9, = 2.18 
and fy, = 3.06. Since 1t = 2.99 is greater than ft), = 2.18, the nondirectional alternative 
hypothesis H,: u, * p, is supported at the .05 level. Since t = 2.99 is less than ty, = 3.06, 
the latter alternative hypothesis is not supported at the .01 level. Thus, if a = .05, the re- 
searcher can conclude that Group 1 recalled a significantly greater number of nonsense syllables 
than Group 2. This result is consistent with that obtained in the previous section using the 
protocol for linear contrasts. 

Equation 21.24 employs multiple ¢ tests/Fisher's LSD test to compute the minimum 
required difference in order for two means to differ significantly from one another at a pre- 
specified level of significance. The latter value is represented by the notation CD,.,, with 
CD being the abbreviation for critical difference. Whereas Equation 21.24 only applies to 
simple comparisons, Equation 21.25 is a generic equation that can be employed for both simple 
and complex comparisons.” 


2MS ye 
n 





(Equation 21.24) 


CDi sp = VFa wo 
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D, sp =y Fi wo) (Equation 21.25) 
Where: Fa wo is the tabled critical F value for df, = 1 and dfi, = dfi; at the prespec- 


ified level of significance? 


Employing the appropriate values from Example 21.1 in Equation 21.24, the value 


CD, = 1.90 is computed.*! 


LSD 


CD, s, = VET Oe» - 1.90 


Thus, in order to differ significantly at the .05 level, the means of any two groups must 
differ from one another by at least 1.90 units. Employing Table 21.5, which summarizes the 
differences between pairs of means involving all three experimental groups, it can be seen that 
the following simple comparisons are significant at the .05 level if multiple ¢ tests/Fisher's LSD 
test are employed: X, - X, = 2.6; X, - X, = 3. The difference X, - X, = .4 is not 
significant, since it is less than CD, sp - 1.90. 

Within the framework of the discussion to follow, it will be demonstrated that the CD value 
computed with multiple t tests/Fisher's LSD test is the smallest CD value that can be computed 
with any of the comparison methods that can be employed for the analysis of variance. 


Table21.5 Differences Between Pairs of Means in Example 21.1 


7 X= 9.2 - 66 = 26 
X, - X% = 9.2 - 6.2 = 3.0 
X, - X, = 6.6 - 6.2 = 0.4 


Multiple ¢ tests/Fisher's LSD test will now be demonstrated for a complex comparison. 
Equations 21.23 and 21.25 are employed to evaluate the complex comparison involving the mean 
of Group 3 versus the composite mean of Groups 1 and 2. Employing Equation 21.23, the 
absolute value t = 2.25 is computed.” The value CD,,, = 1.65 is computed with Equation 
21.25. Note that in computing the values of £ and CD, ,,, the value xg = 1.5 is employed 
in Equations 21.23 and 21.25, as opposed to Ec = 2, which is employed in Equations 21.22 
and 21.24. This latter fact accounts for why the value CD, ṣp = 1.65 computed for the complex 
comparison is smaller than the value CD,,,, = 1.90 computed for the simple comparison. 


"ML eee 


(1.5) (1.9) 
5 


1.5) (1.9 
Disp syt Aa A ) 


Since the obtained absolute value f = 2.25 is greater than ft), = 2.18 (which is the tabled 
critical two-tailed .05 ¢ value for df, = 12), the nondirectional alternative hypothesis 
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H: p, * (p + p,)/2 is supported at the.05 level. Since t = 2.25 is less than the tabled cri- 
tical two-tailed value t,, = 3.06, the latter alternative hypothesis is not supported at the .01 
level. Thus, if a = .05, the researcher can conclude that Group 3 recalled a significantly fewer 
number of nonsense syllables than the average number recalled when the performances of 
Groups 1 and2 are combined. This result is consistent with that obtained in the previous section 
using the protocol for linear contrasts. PEN 

The fact that the obtained absolute value of the difference |X, - X,,,| = |6.2- 7.9| 2 1.7 

is larger than the computed value CD, şp= 1.65, is consistent with the fact that the difference for 
the complex comparison is significant at the .05 level. The computed value CD, y= 1.65 
indicates that for any complex comparison involving the set of coefficients employed in Table 
21.4, the minimum required difference in order for the two mean values stipulated in the null 
hypothesis to differ significantly from one another at the .05 level is CD, ,,— 1.65. 
Test 21b: The Bonferroni-Dunn test First formally described by Dunn (1961), the 
Bonferroni-Dunn test is based on the Bonferroni inequality, which states that the probability 
of the occurrence of a set of events can never be greater than the sum of the individual 
probabilities for each event. Although the Bonferroni-Dunn test is identified in most sources 
as a planned comparison procedure, it can also be employed for unplanned comparisons. In 
actuality, the Bonferroni-Dunn test is computationally identical to multiple ¢ tests/Fisher's 
LSD test/linear contrasts, except for the fact that the equation for the test statistic employs an 
adjustment in order to reduce the value of &,,,. By virtue of reducing &,,, the power of the 
Bonferroni-Dunn test will always be less than the power associated with multiple f tests/ 
Fisher's LSD test/linear contrasts (since the latter procedure does not adjust the value of & py). 
As a general rule, whenever the Bonferroni-Dunn test is employed to conduct all possible pair- 
wise comparisons in a set of data, it provides the least powerful test of an alternative hypothesis 
of all the available comparison procedures. 

The Bonferroni-Dunn test requires a researcher to initially stipulate the highest family- 
wise Type I error rate he is willing to tolerate. For purposes of illustration, let us assume that 
in conducting comparisons in reference to Example 21.1 the researcher does not want the value 
of Oy to exceed .05 (i.e., he does not want more than a 5% chance of committing at least 
one Type I error in a set of comparisons). Let us also assume he either plans beforehand or 
decides after computing the omnibus F value, that he will compare each of the group means with 
one another. This will result in the following three simple comparisons: Group 1 versus Group 
2; Group 1 versus Group 3; Group 2 versus Group 3. To insure that the familywise Type I error 
rate does not exceed .05, «py = .05 is divided by c = 3, which represents the number of 
comparisons that comprise the set. The resulting value «pç = Qp;y/C = .05/3 = .0167 represents 
for each of the comparisons that are conducted, the likelihood of committing a Type I error. 
Thus, even if a Type I error is made for all three of the comparisons, the overall familywise Type 
I error rate will not exceed .05 (since (3)(.0167) = .05). 

In the case of a simple comparison, Equation 21.26 is employed to compute CD,,, 
which will represent the Bonferroni-Dunn test statistic. CD, is the minimum required dif- 
ference in order for two means to differ significantly from one another, if the familywise Type 
I error rate is set at a prespecified level.” 


2MS wo 


CDyp = trip m 


(Equation 21.26) 





The value /,, in Equation 21.26 represents the tabled critical t value at the level of sig- 
nificance that corresponds to the value of œ po (which in this case equals £ 0167) for dfwg (which 
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for Example 21.1 is dfyg = 12). The value of ¢,,,, can be obtained from detailed tables of the 
t distribution prepared by Dunn (1961), or can be computed with Equation 21.27 (which is 
described in Keppel (1991)). 

3 


Tap = £ + E (Equation 21.27) 


A(df = 2) 


The value of z in Equation 21.27 is derived from Table A1 (Table of the Normal Dis- 
tribution) in the Appendix. It is the z value above which @,,,, of the cases in the normal 
distribution falls. Since in our example a), = .0167, we look up the z value above which 
.0167/2 = .0083 of the distribution falls.** From Table A1 we can determine that the z value 
which corresponds to the proportion .0083 is z = 2.39. Substituting z = 2.39 in Equation 
21.27, the value tj, = 2.79 is computed. 


(2.39)3 + 2.39 
4(12 - 2) 


typ = 2.39 + = 2.79 


Substituting the value t,,, = 2.79 in Equation 21.26, the value CD,,, = 2.43 is 


computed. 
CD, = 2.79 eO» - 243 


Thus, in order to be significant, the difference between any pair of means contrasted in a 
simple comparison must be at least 2.43 units. Referring to Table 21.5, we can determine that 
(as is the case with multiple ¢ tests/Fisher’s LSD test) the following comparisons are significant: 
Group 1 versus Group 2; Group 1 versus Group 3 (since the difference between the means of the 
aforementioned groups is larger than CD, = 2.43). Note that the value CD, = 2.43 is 
larger than the value CD,., = 1.90, computed with Equation 21.24. The difference between 
the two CD values reflects the fact that in the case of the Bonferroni-Dunn test, for the set of 
c =3 simple comparisons the value of œ pw is .05, whereas in the case of multiple ¢ tests/Fisher's 
LSD test, the value of à, (which is computed with Equation 21.13) is .14. By virtue of ad- 
justing the value of «4, the Bonferroni-Dunn test will always result in a larger CD value 
than the value computed with multiple ¢ tests/Fisher’s LSD test.” 

The Bonferroni-Dunn test can be used for both simple and complex comparisons. Earlier 
in this section, it was noted that both simple and complex comparisons involving two sets of 
means (where each set of means in a complex comparison consists of a single mean or a com- 
bination of means) represent single degree of freedom comparisons. Keppel (1991, p. 167) notes 
that the number of possible single degree of freedom comparisons (to be designated c, qp) ina 
set of data can be computed with Equation 21.28. 


k 
Ca df) 7 T + co A (Equation 21.28) 


Employing Equation 21.28 with Example 21.1 (where k = 3), the value c = 6 is 


computed. 


a d 


_ (3? - 1) 7 
T o Rer ene 


© 2000 by Chapman & Hall/CRC 


The c, 4 = 6 possible single degree of freedom comparisons that are possible when 
k =3 follow: Group 1 versus Group 2; Group 1 versus Group 3; Group 2 versus Group 3; Group 
1 versus Groups 2 and 3; Group 2 versus Groups 1 and 3; Group 3 versus Groups | and 2. Thus, 
if the Bonferroni-Dunn test is employed to conduct all six possible single degree of freedom 
comparisons with «4,4, = .05, the per comparison error rate will be a,. = .05/6 = .0083. 
Since «4/2 = .0083/2 = .00415, employing Table A1 we can determine the z value above 
which .00415 proportion of cases falls is approximately 2.635. Substituting z = 2.635 in 
Equation 21.27, the value tp = 3.16 is computed. 


3 
boss. 0 19995. x ge 
42 - 2) 


Substituting = 3.16 in Equation 21.26, the value Cy, = 2.75 is computed. 


Cop = 3.16 tem = 2.75 


Since Equation 21.26 is only valid for a simple comparison, the value Cpp = 2.75 only 
applies to the three simple comparisons that are possible within the full set of six single degree 
of freedom comparisons. Thus, in order to be significant, the difference between any two means 
contrasted in a simple comparison must be at least 2.75 units. Note that the latter value is larger 
than CD, = 2.43 computed for a set of c = 3 comparisons. 

If one or more complex comparisons are conducted, the t,,,, value a researcher employs 
will depend on the total number of comparisons (both simple and complex) being conducted. 
As is the case with multiple ¢ tests/Fisher's LSD test, the computed value of CD, will also 
be a function of the coefficients employed in a comparison. Equation 21.29 is a generic equation 
that can be employed for both simple and complex comparisons to compute the value of CD 
In the case of a simple comparison the term Ec = 2, thus resulting in Equation 21.26. 


YXc^2(MS 
CDyp = tap DEPO (Equation 21.29) 


Equation 21.29 will be employed for the complex comparison of Group 3 versus the com- 
bined performance of Groups 1 and 2. On the assumption that the six possible single degree of 
freedom comparisons are conducted, the value tp = 3.16 will be employed in the equation. 


boss BUE C = 2.39 


Thus, in order to be significant, the absolute value of the difference X; - [(X, + X,)/2] 
must be at least 2.39 units. Since the obtained absolute difference of 1.7 is less than the latter 
value, the nondirectional alternative hypothesis H,: p, * (u, + p,)/2 is not supported. 
Recollect that when multiple ¢ tests/Fisher's LSD test are employed for the same comparison, 
the latter alternative hypothesis is supported. The difference between the results of the two 
comparison procedures illustrates that the Bonferroni-Dunn test is a more conservative/less 


fip 


BID* 


powerful procedure. 
It should be noted that in conducting comparisons (especially if they are planned) a 
researcher may not elect to conduct all possible comparisons between pairs of means. Obviously, 
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the fewer comparisons that are conducted, the higher the value for & pç that can be employed for 
the Bonferroni-Dunn test. Nevertheless, some researchers consider the Bonferroni-Dunn 
adjustment to be too severe, regardless of how many comparisons are conducted. In view of this, 
various sources (e.g., Howell (1992, 1997) and Keppel (1991)) describe a modified version of 
the Bonferroni-Dunn test that can be employed if one believes that the procedure described in 
this section sacrifices too much power. 

It should also be pointed out that in conducting the Bonferroni-Dunn test, as well as 
any other comparison procedures which result in a reduction of the value of @,,., it is not 
necessary that each comparison be assigned the same «pç value. As long as the sum of the 
Op values adds up to the value stipulated for &,,,, the a, values can be distributed in any 
way the researcher deems prudent. Thus, if certain comparisons are considered more important 
than others, the researcher may be willing to tolerate a higher c. rate for such comparisons. 
As an example, assume a researcher conducts three comparisons and sets à, = .05. If the 
first of the comparisons is considered to be the one of most interest and the researcher wants to 
maximize the power of that comparison, the & pç rate for that comparison can be set equal to .04, 
and the &pç rate for each of the other two comparisons can be set at .005. Note that since the 
sum of the three values is .05, the value à, = .05 is maintained. 


Test 21c: Tukey's HSD test (The term HSD is an abbreviation for honestly significant dif- 
ference) ^ Tukey's HSD test is generally recommended for unplanned comparisons when a 
researcher wants to make all possible pairwise comparisons (i.e., simple comparisons) in a set 
of data. The total number of pairwise comparisons (c) that can be conducted for a set of data 
can be computed with the following equation: c = [k(k — 1)]/2. Thus, if k 2 3, the total number 
of possible pairwise comparisons is c = [3(3 — 1)]/2 = 3 2 3 (which in the case of Example 21.1 
are Group 1 versus Group 2, Group 1 versus Group 3, and Group 2 versus Group 3). 

Tukey's HSD test (Tukey, 1953) controls the familywise Type I error rate so that it will 
not exceed the prespecified alpha value employed in the analysis. Many sources view it as a 
good compromise among the available unplanned comparison procedures, in that it maintains an 
acceptable level for ,,,,, without resulting in an excessive decrease in power." Tukey's HSD 
test is one of a number of comparison procedures that are based on the Studentized range 
statistic (which is represented by the notation q). Like the ¢ distribution, the distribution for the 
Studentized range statistic is also employed to compare pairs of means. When the total number 
of groups/treatments involved in an experiment is greater than two, a tabled critical q value will 
be higher than the corresponding tabled critical £ value for multiple t tests/Fisher's LSD test 
for the same comparison. For a given degrees of freedom value, the magnitude of a tabled 
critical q value increases as the number of groups employed in an experiment increase. Table 
A13 in the Appendix (which is discussed in more detail later in this section) is the Table of the 
Studentized Range Statistic. 

Equation 21.30 is employed to compute the q statistic, which represents the test statistic 
for Tukey's HSD test. 

X, Hu X, . 
q = ——— (Equation 21.30) 


MS, 
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Equation 21.30 will be employed with Example 21.1 to conduct the simple comparison of 
Group 1 versus Group 2. When the latter comparison is evaluated with Equation 21.30, the value 
q = 4.22 is computed. 
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The obtained value q = 4.22 is evaluated with Table A13. The latter table contains the two- 
tailed .05 and .01 critical g values that are employed to evaluate a nondirectional alternative 
hypothesis at the .05 and 01 levels, or a directional alternative hypothesis at the .10 and .02 
levels. (It should be noted that Kirk (1995, p. 144) states that the Tukey procedure and all other 
unplanned comparison procedures should only be employed to evaluate a nondirectional 
alternative hypothesis.) As is the case with previous comparisons, the analysis will be in 
reference to the nondirectional alternative hypothesis H,: p; * p,, with a, = .05. Employing 
the section of Table A13 for the .05 critical values, we locate the q value that is in the cell which 
is the intersection of the column for k 2 3 means (which represents the total number of groups 
upon which the omnibus F value is based) and the row for df... = 12 (which represents the 
value df, = Garo. = 12 computed for the omnibus F test). The tabled critical q ,, value for 
k=3 means and dfo = 12is qs, = 3.77. In order to reject the null hypothesis, the obtained 
absolute value of g must be equal to or greater than the tabled critical value at the prespecified 
level of significance.” Since the obtained value q = 4.22 is greater than the tabled critical two- 
tailed value q o, = 3.77, the nondirectional alternative hypothesis H,: u, * m, is supported at 
the .05 level (where .05 represents the value of ©,,,). It is not supported at the .01 level, since 
q = 4.22 is less than the tabled critical value q9, = 5.05. 

Equation 21.31, which is algebraically derived from Equation 21.30, can be employed to 
compute the minimum required difference ( CD,,.., ) in order for two means to differ significantly 
from one another at a prespecified level of significance.^? 


= 422 
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CD sp (Equation 21.31) 


= Vk, dye) 


Where: qg, diro is the tabled critical q value for k groups/means and df; is the value 
employed for the omnibus F test 


Employing the appropriate values in Equation 21.31, the value CD.) = 2.32 is computed. 


1.9 


CDs, = B| = = 2.32 


Thus, in order to be significant, the difference X, - X, must be at least 2.32 units. Note 
that the value CD pgp = 2.32 is greater than CD,,,, = 1.90, but less than CD,,, = 2.43. This 
reflects the fact that in conducting all pairwise comparisons, Tukey’s HSD test will provide a 
more powerful test of an alternative hypothesis than the Bonferroni-Dunn test. It also indicates 
that Tukey's HSD test is less powerful than multiple ¢ tests/Fisher's LSD test (which does not 
control the value of & pwy). 

The use of Equations 21.30 and 21.31 for Tukey's HSD test is based on the following 
assumptions: a) The distribution of data in the underlying population from which each of the 
samples is derived is normal; b) The variances of the k underlying populations represented by 
the k samples are equal to one another (i.e. homogeneous); and c) The sample sizes of each of 
the groups being compared are equal. Kirk (1982, 1995), among others, discusses a number of 
modifications of Tukey's HSD test which are recommended when there is reason to believe that 
all or some of the aforementioned assumptions are violated." 
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Test 21d: The Newman-Keuls test The Newman-Keuls test (Keuls (1952), Newman (1939)) 
is another procedure for pairwise unplanned comparisons that employs the Studentized range 
statistic. Although the Newman-Keuls test is more powerful than Tukey's HSD test, unlike 
the latter test it does not insure that in a set of pairwise comparisons the familywise Type I error 
rate will not exceed a prespecified alpha value. Equations 21.30 and 21.31 (which are employed 
for Tukey's HSD test) are also employed for the Newman-Keuls test. When, however, the 
latter equations are used for the Newman-Keuls test, the appropriate tabled critical q value will 
be a function of how far apart the two means being compared are from one another. 

To be more specific, the k means are arranged ordinally (i.e., from lowest to highest). 
Thus, in the case of Example 21.1, the k 2 3 means are arranged in the following order: 


X c3 Mage X =9.2 


For each pairwise comparison that can be conducted, the number of steps between the two 
means involved is determined. Because of the fact that the tabled critical q value is a function 
of the number of steps or layers that separate two mean values, the Newman-Keuls test is often 
referred to as a stepwise or layered test. The number of steps (which will be represented by the 
notation s) between any two means is determined as follows: Starting with the lower of the two 
mean values (which will be the mean to the left), count until the higher of the two mean values 
is reached. Each mean value employed in counting from the lower to the higher value in the pair 
represents one step. Thus, s — 2 steps are involved in the simple comparison of Group 1 versus 
Group 2, since we start at the left with X, = 6.6 (the lower of the two means involved in the 
comparison) which is step 1, and move right to the adjacent value X, - 9.2 (the higher of the 
two means involved in the comparison), which represents step 2. If Group 1 and Group 3 are 
compared, s = 3 steps separate the two means, since we start at the left with X, = 6.2 (the 
lower of the two means involved in the comparison), move to the right counting X, = 6.6 as step 
2, and then move to X, = 9.2 (the higher of the two means involved in the comparison), which 
is step 3. 

The Newman-Keuls test protocol requires that the pairwise comparisons be conducted in 
a specific order. The first comparison conducted is between the two means which are separated 
by the largest number of steps (which will represent the largest absolute difference between any 
two means). If the latter comparison is significant, any comparisons involving the second largest 
number of steps are conducted. If all the comparisons in the latter subset of comparisons are 
significant, the subset of comparisons for the next largest number of steps is conducted, and so 
on. The basic rule upon which the protocol is based is that if at any point in the analysis a 
comparison fails to yield a significant result, no further comparisons are conducted on pairs of 
means that are separated by fewer steps than the number of steps involved in the nonsignificant 
comparison. Employing this protocol with Example 21.1, the first comparison that is conducted 
is X, versus X,, which involves s = 3 steps. If that comparison is significant, the comparisons 
X, versus X, and X, versus X,, both of which involve s = 2 steps, are conducted. 

In employing Table A13 to determine the tabled critical q value to employ with Equations 
21.30 and 21.31, instead of employing the column for k (the total number of groups/means in the 
set of data), the column that is used is the one that corresponds to the number of steps between 
the two means involved in a comparison. Thus, if Group 1 and Group 3 are compared, the 
column for k 23 groups in Table A13 is employed in determining the value of q. Since the value 
of df... remains dfyg = 12, the value qu, = 3.77 is employed in Equation 21.31. Since the 
latter value is identical to the q value employed for Tukey's HSD test, the Newman-Keuls value 
computed for the minimum required difference for two means ( CD yg) will be the same as the 


value CD, computed for Tukey’s HSD test." Thus, CD,,, = (3.77)/1.9/5 = 2.32. The 
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latter result indicates that if a = .05, the minimum required difference between two means for 
the comparison Group 1 versus Group 3 is CDy)x = 2.32." Since the absolute difference 
|X, - X,| = 3 is greater than CD,,, = 2.32, the comparison is significant. 

Since the result of the s 2 3 step comparison is significant, the Newman-Keuls test is 
employed to compare Group 1 versus Group 2 and Group 2 versus Group 3, which represent 
s = 2 step comparisons. The value q; = 3.08 is employed in Equation 21.31, since the latter 
value is the tabled critical q 5, value for k= 2 means and df, = 12. Substituting qu, = 3.08 
in Equation 21.31 yields the value CD,,, = (3.08)/1.9/5 = 1.90. Since the absolute difference 
|X, - X,| = 2.6 is greater than CD,,, = 1.90, there is a significant difference between the 
means of Groups 1 and 2. Since the absolute difference |X, - X,| = .4 is less than CD), 
= 1.90, the researcher cannot conclude there is a significant difference between the means of 
Groups 2 and 3. Note that the value CD yg = 1.90 computed for a two-step analysis is smaller 
than the value CD,,,, = 2.32 computed for the three-step analysis. The general rule is that the 
fewer the number of steps, the smaller the computed CD, value. 

The astute reader will observe that CD,,,, = 1.90 is identical to CD,,, = 1.90 obtained 
with Equation 21.24. The latter result illustrates that when two steps are involved in a Newman- 
Keuls comparison, it will always yield a value that is identical to CD, ,,, (i.e. the value computed 
with multiple ¢ tests/Fisher's LSD test). In point of fact, when s = 2, for a given degrees of 
freedom value, the relationship between the values of q and t is as follows: q = t/2 and 
t = qlJ2. If the value Qo, = 3.08 is employed in the equation ft = q2, t = (3.08.42 
= 2.18. Note that (f. = 2.18)? = (Fo; = 4.75), and that F; = 4.75 is the tabled critical 
value employed in Equation 21.24 which yields the value CD,,,, = 1.90. 

The value of & pw associated with the Newman-Keuls test will be higher than the value 
of «4, for Tukey's HSD test. This is a direct result of the fact that the tabled critical 
Studentized range values employed for the Newman-Keuls test are smaller than those 
employed for Tukey’s HSD test (with the exception of the comparison contrasting the lowest 
and highest means in the set of k means). Within any subset of comparisons which are an equal 
number of steps apart from one another, the overall Type I error rate for the Newman-Keuls test 
within that subset will not exceed the prespecified value of alpha. The latter, however, does not 
insure that æ pw for all the possible pairwise comparisons will not exceed the prespecified value 
of alpha. Because of its higher o,,, rate, the Newman-Keuls test is generally not held in high 
esteem as an unplanned comparison procedure. Excellent discussions of the Newman-Keuls 
test can be found in Maxwell and Delaney (1990) and Howell (1992, 1997). 


Test 21e: The Scheffé test The Scheffé test (Scheffé, 1953), which is employed for unplanned 
comparisons, is commonly described as the most conservative of the unplanned comparison 
procedures. The test maintains a fixed value for &,,,, regardless of how many simple and 
complex comparisons are conducted. By virtue of controlling for a large number of potential 
comparisons, the error rate for any single comparison (i.e., 0,,.) will be lower than the error 
rate associated with any of the other comparison procedures (assuming an alternative procedure 
employs the same value for Gu Since in conducting unplanned pairwise comparisons 
Tukey's HSD test provides a more powerful test of an alternative hypothesis than the Scheffé 
test, most sources note that it is not prudent to employ the Scheffé test if only simple comparisons 
are being conducted. The Scheffé test is, however, recommended whenever a researcher wants 
to maintain a specific &,, level, regardless of how many simple and complex comparisons 
are conducted. Sources note that the Scheffé test can accommodate unequal sample sizes and is 
quite robust with respect to violations of the assumptions underlying the analysis of variance (1.e., 
homogeneity of variance and normality of the underlying population distributions). Because of 
the low value the Scheffé test imposes on the value of «,, and the consequent loss of power 
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associated with each comparison, some sources recommend that in using the test a researcher 
employ a larger value for «pẹ than would ordinarily be the case. Thus, one might employ 
Opy = -10 for the Scheffé test, instead of «,,, = .05 which might be employed for a less 
conservative comparison procedure. 

Equation 21.32 is employed to compute the minimum required difference for the Scheffé 
test ( CD.) in order for two means to differ significantly from one another at a prespecified level 
of significance. Whereas Equation 21.32 only applies to simple comparisons, Equation 21.33 
is a generic equation that can be employed for both simple and complex comparisons. 





k - Dap. due) (Equation 21.32) 
(Xc;)(MS we) . 
CD, E (k ui DG; s. dyo) 4 —— (Equation 21.33) 
In Equations 21.32 and 21.33, the value F, is the tabled critical value that is 


(faq dfi) 
employed for the omnibus F test for a value of alpha that corresponds to the value of a. 


Thus, in Example 21.1, if apy = .05, Fo; = 3.89 for df = 2,12 is used in Equations 
21.32/21.33. Employing Equation 21.32, the value CD, = 2.43 is computed for the simple 
comparison of Group 1 versus Group 2. 


CD, - (@ - 18.89) SC» - 243 


Thus, in order to be significant, the difference X, - X, (as well as the difference be- 
tween the means of any other two groups) must be at least 2.43 units. Since the absolute 
difference |X, - X,| = 2.6 is greater than CD, = 2.43, the nondirectional alternative hy- 
pothesis H,: pu, * p, is supported. Note that for the same comparison, the CD value computed 
for the Scheffé test is larger than the previously computed values CD, ., = CD wg = 1.90 and 
CDysp = 2.32, but is equivalent to CD,,, = 2.43. The fact that CD, = CD, for the com- 
parison under discussion illustrates that although for simple comparisons the Scheffé test is 
commonly described as the most conservative of the unplanned comparison procedures, when 
the Bonferroni-Dunn test is employed for simple comparisons it may be as or more 
conservative than the Scheffé test. Maxwell and Delaney (1990) note that in instances where 
a researcher conducts a small number of comparisons, the Bonferroni-Dunn test will provide 
a more powerful test of an alternative hypothesis than the Scheffé test. However, as the number 
of comparisons increase, at some point the Scheffé test will become more powerful than the 
Bonferroni-Dunn test. In general (although there are some exceptions), the Bonferroni-Dunn 
test will be more powerful than the Scheffé test when the number of comparisons conducted is 
less than [k(k — 1)]/2. 

The Scheffé test is most commonly recommended when at least one complex comparison 
is conducted in a set of unplanned comparisons. Although the other comparison procedures 
discussed in this section can be employed for unplanned complex comparisons, the Scheffé test 
is viewed as a more desirable alternative by most sources. It will now be demonstrated how the 
Scheffé test can be employed for the complex comparison X, versus (X, + X,)/2. Substitut- 
ing the value Ec? = 1.5 (which is computed in Table 21.4) and the other relevant values in 
Equation 21.33, the value CD, = 2.11 is computed. 
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Thus, in order to be significant, the difference X, - [(X, * X,)/2] (as well as the dif- 
ference between any set of means in a complex comparison for which Ec? = 1.5) must be 
equal to or greater than 2.11 units. Since the absolute difference IX, - [Q, + X,)/2]| 2 1.7 is 
less than CD, = 2.11, the nondirectional alternative hypothesis H,: p, * (u, + p,)/2 is not 
supported. Note that the CD value computed for the Scheffé test is larger than the previously 
computed value CD, s, = 1.65, but less than CD,,, = 2.39 computed for the same complex 
comparison. This reflects the fact that for the complex comparison X, versus (X, + X,)/2, 
the Scheffé test is not as powerful as multiple t tests/Fisher’s LSD test, but is more powerful 
than the Bonferroni-Dunn test.” 

Equation 21.34 can also be employed for the Scheffé test for both simple and complex 
comparisons. 


F.-(k- DF (BG, WG) (Equation 21.34) 
Where: F gc, wo isthetabled critical value at the prespecified level of significance employed 


in the omnibus F test 


In order to use Equation 21.34 it is necessary to first employ Equations 21.17-21.19 to 
compute the value of E di for the comparison being conducted. The value computed for 
Be will serve as the test statistic for the Scheffé test. Earlier in this section (under the 
discussion of linear contrasts) the value Fono = 5.07 is computed for the complex com- 
parison X, versus (X, + X,)/2. When F comp is used as the test statistic for the Scheffé test, 
the critical value employed to evaluate it is different than the critical value employed in 
evaluating alinear contrast. Equation 21.34 is used to determine the Scheffé test critical value. 
In order for a comparison to be significant, the computed value of F pm, must be equal to or 
greater than the critical F value computed with Equation 21.34 (which is represented by the 
notation F,). In employing Equation 21.34 to compute F,, the tabled critical value employed 
for the omnibus F test (which in the case of Example 21.1 is F4; = 3.89) is multiplied by (k — 
1). Obviously, the resulting value will be higher than the tabled critical value employed for the 
linear contrast for the same comparison. When the appropriate values for Example 21.1 are 


substituted in Equation 21.34, the value F, = 7.78 is computed. 


= (3 - (3.89) = 7.78 


Since the computed value F mp = 5.07 is less than the critical value F, = 7.78 
computed with Equation 21.34, it indicates that the nondirectional alternative hypothesis 
a D * (uw, + m )/2 is not supported. Note that the value F, = 7.78 is larger than the value 

= 4.75 (which is the tabled critical value for Boum = = 1 "and dfin = dfyg = 12) that is 
ee for the linear contrast for the same comparison. Recollect that the alternative hy- 
pothesis H,: u,* (gu, + m,)/2 is supported when a linear contrast is conducted. 

In closing the discussion of the Scheffé test some general comments will be made regarding 
the value of & pwy for the Scheffé test, the Bonferroni-Dunn test, and Tukey’s HSD test. In 
the discussion to follow it will be assumed that upon computation of an omnibus F value, a 
researcher wishes to conduct a series of unplanned comparisons for which the familywise error 
rate does not exceed a, = .05.%° 

a) If all possible comparisons (simple and complex) are conducted with the Scheffé test, the 
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value of &4,, will equal exactly .05. When k > 3 there are actually an infinite number of 
comparisons that can be made (Maxwell and Delaney (1990, p. 190)). To illustrate this, assume 
k 23. Beside the three pairwise/simple comparisons (Group 1 versus Group 2; Group 1 versus 
Group 3; Group 2 versus Group 3) and the three apparent complex comparisons (The average 
of Groups 1 and 2 versus Group 3; The average of Groups 1 and 3 versus Group 2; The average 
of Groups 2 and 3 versus Group 1), in the case of a complex comparison it is possible to combine 
two groups so that one group contributes more to the composite mean representing the two 
groups than does the other group. As an example, in comparing Groups 1 and 2 with Group 3, 
a coefficient of 7/8 can be assigned to Group 1 and a coefficient of 1/8 to Group 2. Employing 
these coefficients, a composite mean value can be computed to represent the mean of the two 
groups which is contrasted with Group 3. It should be obvious that if one can stipulate any 
combination of two coefficients/weights that add up to 1, there are potentially an infinite number 
of coefficient combinations that can be assigned to any two groups, and therefore an infinite 
number of possible comparisons can result from coefficient combinations involving the 
comparison of two groups with a third group. If fewer than all possible comparisons are 
conducted with the Scheffé test, the value of c, will be less than .05, thus making it an overly 
conservative test (since the value of c. will be lower than is necessary for & pẹ to equal .05). 

b) If all possible comparisons are conducted employing the Bonferroni-Dunn test, the 
value of Lew will be less than .05. As noted earlier, as the number of comparisons conducted 
increases, at some point the value of « FW for the Bonferroni-Dunn test will be less than Oey 
for the Scheffé test, and thus at that point the Bonferroni-Dunn test will be even more 
conservative (and thus less powerful) than the Scheffé test. The decrease in the value of c, 
for the Bonferroni-Dunn test results from the fact that within the set of comparisons 
conducted, not all comparisons will be orthogonal with one another. Winer et al. (1991) note 
that when comparisons conducted with the Bonferroni-Dunn test are orthogonal, the fol- 
lowing is true: Gy, = c(«,). However, when some of the comparisons conducted are not 
orthogonal, Gy, < c(«,'). Thus, by virtue of some of the comparisons being nonorthogonal, 
the Bonferroni-Dunn test becomes a more conservative test (i.e., py < .05). 

c) If Tukey's HSD test is employed for conducting all possible pairwise comparisons, the 
value of «py will be exactly .05, even though the full set of pairwise comparisons will not 
constitute an orthogonal set. As noted above, if the Bonferroni-Dunn test is employed for the 
full set of pairwise comparisons, due to the presence of nonorthogonal comparisons, the value 
of & py will be less than .05, and thus the value of CD, will be larger than the value of CD psp- 
If in addition to conducting all pairwise comparisons with Tukey's HSD test, complex com- 
parisons are also conducted, the value of «&,,, will exceed .05. As the number of complex 
comparisons conducted increases, the value of & pẹ increases. When complex comparisons are 
conducted, Tukey's HSD test is not as powerful as the Scheffé test. 


Test 21f: The Dunnett test The Dunnett test (1955, 1964) is a comparison procedure, only 
employed for simple comparisons, that is designed to compare a control group with the other 
(k— 1) groups ina set of data. Under such conditions the Dunnett test provides a more powerful 
test of an alternative hypothesis than do the Bonferroni-Dunn test, Tukey’s HSD test, and the 
Scheffé test. This is the case, since for the same value of & rw» the &po value associated with 
the Dunnett test will be higher than the &pç values associated with the aforementioned pro- 
cedures (and, by virtue of this, provides a more powerful test of an alternative hypothesis). The 
larger c. value for the Dunnett test is predicated on the fact that by virtue of limiting the 
comparisons to contrasting a control group with the other groups, the Dunnett test statistic is 
based on the assumption that fewer comparisons are conducted than will be the case if all pairwise 
comparisons are conducted. Consequently, if a researcher specifies that «pẹ = .05, and the mean 
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of the control group is contrasted with the means of each of the other (k — 1) groups, the Dunnett 
test insures that the familywise Type I error rate will not exceed .05. It should be noted that 
since the control group is involved in each of the comparisons that are conducted, the com- 
parisons will not be orthogonal to one another. In illustrating the computation of the Dunnett 
test statistic, we will assume that in Example 21.1 Group 1 (the group that is not exposed to 
noise) is a control group, and that Groups 2 and 3 (both of which are exposed to noise) are 
experimental groups. Thus, employing the Dunnett test, the following two comparisons will be 
conducted with à, = .05: Group 1 versus Group 2; Group 1 versus Group 3. 

The test statistic for the Dunnett test (1) is computed with Equation 21.35, which, except 
for the fact that a fj, value is computed, is identical to Equation 21.22 (which is employed to 
compute the test statistic for multiple ¢ tests/Fisher's LSD test). 


X -X 
ip m (Equation 21.35) 
2MS yG 


n 


Equation 21.35 is employed below to compute the value ¢, = 2.99 for the simple 
comparison of Group 1 versus Group 2. 


_ 9.2 - 66 


4.9) 
5 


The computed value tj = 2.99 is evaluated with Table A14 (Table of Dunnett's Modi- 
fied ¢ Statistic for a Control Group Comparison) in the Appendix. The latter table, which 
contains both two-tailed and one-tailed .05 and .01 critical values, is based on a modified t 
distribution derived by Dunnett (1955, 1964). Dunnett (1955) computed one-tailed critical 
values, since in comparing one or more treatments with a control group a researcher is often 
interested in the direction of the difference. The tabled critical t, values are listed in reference 
to k, the total number of groups/treatments employed in the experiment, and the value of 
df... = Ywo computed for the omnibus F test. 

For k = 3 and df... = dfyg = 12, the tabled critical two-tailed .05 and .01 values are 
t 


Da = 2.50 and Ip , = 3.39, and the tabled critical one-tailed .05 and .01 values are 
55. - 2.11 and fy : =3.01. The computed value f, = 2.99 is greater than the tabled critical two- 


= 2.99 


tailed and one-tailed .05 tp values but less than the tabled critical two-tailed and one-tailed .01 £j 
values. Thus, for à, = .05 (but not à, = .01), the nondirectional alternative hypothesis 
Hj: p, * p, and the directional alternative hypothesis H,: pu, > p, are supported. The 
second comparison involving the control group (Group 1) versus Group 3, yields the value 
tp = (9.2 - 6.2) /[Q2)(1.9)]/5 = 3.45. Since the value t} = 3.45 is greater than the tabled 
critical two-tailed and one-tailed .05 and .01 t, values, the nondirectional alternative hypothesis 
H: p, * p, and the directional alternative hypothesis H,: p, > p, are supported for both 
Ory = -05 and a, = .01. 

Equation 21.36 is employed to compute the minimum required difference for the Dunnett 
test (designated CD,) in order for two means to differ significantly from one another at a pre- 
specified level of significance. 


2MS 


t 
Da, afwo n 


CD, = ue (Equation 21.36) 
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Where: tp 


i is the tabled critical value for Dunnett’s modified t statistic for k groups and 
dfg at the prespecified value of & pw 
In employing Equation 21.36, &,, = .05 will be employed, and it will be assumed that a 
nondirectional alternative hypothesis is evaluated for both comparisons. Substituting the two- 
tailed .05 value t; _ = 2.50 in Equation 21.36, the value CD, = 2.18 is computed. 
.05 


CD, = 2.50 d - 2.18 





Thus, in order to be significant, the differences X, - X, and X, - X, must be at least 
2.18 units. Since the absolute differences |X, - X,| = 2.6 and |X, - X,| = 3 are greater than 
CD, = 2.18, the nondirectional alternative hypothesis is supported for both comparisons. Note 
that CD, = 2.18 computed for the Dunnett test is larger than CD,,,, = 1.90 computed for 
multiple ¢ tests/Fisher’s LSD test (for a simple comparson), but is less than the CD values 
computed for a simple comparison for the Bonferroni-Dunn test (CD,,, = 2.43), Tukey’s 
HSD test (CD, = 2.32), and the Scheffé test (CD, = 2.43). 


Additional discussion of comparison procedures and final recommendations The accuracy 
of the comparison procedures described in this section may be compromised if the homogeneity 
of variance assumption underlying the analysis of variance (the evaluation of which is described 
later in Section VI) is violated. This is the case, since in such an instance MS yg may not provide 
the best measure of error variability for a given comparison. Violation of the homogeneity of 
variance assumption can either increase or decrease the Type I error rate associated with a 
comparison, depending upon whether MS yg overestimates or underestimates the pooled var- 
liability of the groups involved in a specific comparison. It is also the case that when the 
homogeneity of variance assumption is violated, the accuracy of a comparison may be even 
further compromised when there is not an equal number of subjects in each group. Sources that 
discuss these general issues (e.g., Howell (1992, 1997), Kirk (1982, 1995), Maxwell and Delaney 
(1990), Winer et al. (1991)) provide alternative equations which are recommended when the 
homogeneity of variance assumption is violated and/or sample sizes are unequal. 

As a general rule, the measure of within-groups variability that is employed in equations 
which are recommended when there is heterogeneity of variance is based on the pooled within- 
groups variability of just those groups which are involved in a specific comparison. Since the 
latter measure has a smaller degrees of freedom associated with it than MS, the tabled 
critical value for the analysis will be based on fewer degrees of freedom. Although the loss of 
degrees of freedom can reduce the power of the test, it may be offset if the revised measure of 
within-groups variability is less than MS,;. A full discussion of the subject of violation of 
the homogeneity of variance assumption with comparisons is beyond the scope of this book. The 
reader who refers to sources that discuss the subject in greater detail will discover that there is 
a lack of agreement with respect to what procedure is most appropriate to employ when the 
assumption is violated. 

Numerous other multiple comparison procedures have been developed in addition to those 
described in this section. Howell (1992, 1997) and Kirk (1995), among others, describe a number 
of alternative procedures. Howell (1997) describes a procedure developed by Ryan (1960), which 
is a compromise between Tukey’s HSD test and the Newman-Keuls test. To be more specific, 
Ryan's (1960) procedure maintains the value of c, at the desired level, but at the same time 
allows the critical difference required between pairs of means to vary as a function of step size. 
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Kirk (1995, Ch. 4) describes the merits and limitations of 22 multiple comparison procedures. 
Within the framework of his discussion, Kirk (1995) notes that in conducting comparisons, a 
researcher's priority should be to guard against inflation of the Type I error rate, yet at the same 
time to employ a procedure that maximizes power (i.e., has a high likelihood of identifying a 
significant effect). 

At this point the author will present some general recommendations regarding the use of 
comparison procedures for the analysis of variance." From what has been said, it should be 
apparent that in conducting comparisons the minimum value required for two means to differ 
significantly from one another can vary dramatically, depending upon which comparison pro- 
cedure a researcher employs. Although some recommendations have been made with respect to 
when it is viewed most prudent to employ each of the comparison procedures, in the final 
analysis the use of any of the procedures does not insure that a researcher will determine the truth 
regarding the relationship between the variables under study. Aside from the fact that researchers 
do not agree among themselves on which comparison procedure to employ (due largely to the 
fact that they do not concur with respect to the maximum acceptable value for œ pw), there is also 
the problem that one is not always able to assume with a high degree of confidence that all of the 
assumptions underlying a specific comparison procedure have, in fact, been met. In view of this, 
any probability value associated with a comparison may always be subject to challenge. 
Although most of the time a probability value may not be compromised to that great a degree, 
when one considers the fact that researchers may quibble over whether one should employ 
Ow = .05 versus «py = .10,aminimal difference with respect to a probability value can mean 
a great deal, since it ultimately may determine whether a researcher elects to retain or reject a null 
hypothesis. If, in fact, the status of the null hypothesis is based on the result of a single study, 
it would seem that a researcher is obliged to arrive at a probability value in which he and others 
can have a high degree of confidence.” 

In view of everything that has been discussed, this writer believes that the general strategy 
for conducting comparisons suggested by Keppel (1991) is the most prudent to employ. Keppel 
(1991) suggests that in hypothesis testing involving unplanned comparisons, instead of just 
employing the two decision categories of retaining the null hypothesis versus rejecting the 
null hypothesis, a third category, suspend judgement be added. Specifically, Keppel (1991) 
recommends the following: 

a) If the obtained difference between two means is less than the value of CD, sp; retain 
the null hypothesis. Since the value of CD, şp will be the smallest CD value computed with any 
of the available comparison procedures, it allows for the most powerful test of an alternative 
hypothesis. 

b) If the obtained difference between two means is equal to or greater than the CD value 
associated with the comparison procedure which results in the largest oy value one is willing 
to tolerate, reject the null hypothesis. Based on the procedures described in this book (de- 
pending upon the number of comparisons that are conducted), the largest CD value will be 
generated through use of either the Bonferroni-Dunn test or the Scheffé test. However, the 
largest 0,,,, value a researcher may be willing to tolerate may be larger than the o,., value 
associated with either of the aforementioned procedures. In such a case, the minimum CD value 
required to reject the null hypothesis will be smaller than CD,,, or CD;. 

c) If the obtained difference between two means is greater than or equal to CD,.,,, but less 
than the CD value associated with the largest «,,, value one is willing to tolerate, suspend 
judgement. It is recommended that one or more replication studies be conducted employing the 
relevant groups/treatments for any comparisons that fall in the suspend judgement category. 

The above guidelines will now be applied to Example 21.1. Let us assume that three simple 
comparisons are conducted, and that the maximum familywise Type I error rate the researcher 
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is willing to tolerate is «pẹ = .05. Employing the above guidelines, the null hypothesis can be 
rejected for the comparisons of Group 1 versus Group 2 and Group 1 versus Group 3. This is 
the case, since X, - X, = 2.6 and X, - X, = 3, and in both instances the obtained difference 
between the means of the two groups is greater than CD, = 2.32 (which insures that Oy, 
will not exceed .05). On the other hand, for the comparison of Group 2 versus Group 3, since 
the difference X, - X, = .4 is less than CD, = 1.90 (which is the lowest of the computed 
CD values), the null hypothesis cannot be rejected. Thus, in the case of Example 21.1, none of 
the comparisons falls in the suspend judgement category. In point of fact, even if the &,,,, value 
associated with the Bonferroni-Dunn and/or Scheffé tests is employed as the maximum 
acceptable &., rate, none of the comparisons falls in the suspend judgement category, since 
the absolute difference between the means for each comparison is either greater than or less than 
the relevant CD values. 

As noted earlier, in the case of planned comparisons the issue of whether or not a 
researcher should control the value of & pw is subject to debate. It can be argued that the strategy 
described for unplanned comparisons should also be employed with planned comparisons — 
with the stipulation that the largest value for c, that one is willing to tolerate for planned com- 
parisons be higher than the value employed for unplanned comparisons. Nevertheless, one can 
omit the latter stipulation and argue that the same criterion be employed for both unplanned and 
planned comparisons. The rationale for the latter position is as follows. Assume that two 
researchers independently conduct the identical study. Researcher 1 has the foresight to plan c 
comparisons beforehand. Researcher 2, on the other hand, conducts the same set of c com- 
parisons, but does not plan them beforehand. The truth regarding the populations involved in the 
comparisons is totally independent of who conducts the study. As a result of this, one can argue 
that the same criterion be applied, regardless of who conducts the investigation. If Researcher 
1 is allowed to conduct a less conservative analysis (i.e., tolerate a higher o,,, rate) than 
Researcher 2, it is commensurate with giving Researcher | a bonus for having a bit more acumen 
that Researcher 2 (if we consider allowing one greater latitude with respect to rejecting the null 
hypothesis to constitute a bonus). It would seem that if, in the final analysis, the issue at hand 
is the truth concerning the populations under study, each of the two researchers should be 
expected to adhere to the same criterion, regardless of their expectations prior to conducting the 
study. If one accepts this line of reasoning, it would seem that the guidelines described in this 
section for unplanned comparisons should also be employed for planned comparisons. 

In the final analysis, regardless of whether one has conducted planned or unplanned com- 
parisons, when there is reasonable doubt in the mind of the researcher or there is reason to 
believe that there will be reasonable doubt among those who scrutinize the results of a study, it 
is always prudent to replicate a study. In other words, anytime the result of an analysis falls 
within the suspend judgement category (or perhaps even if the result is close to falling within 
it), a strong argument can be made for replicating a study. Thus, regardless of which comparison 
procedure one employs, if the result of a comparison is not significant, yet would have been had 
another comparison procedure been employed, it would seem logical to conduct at least one 
replication study in order to clarify the status of the null hypothesis. There is also the case where 
the result of a comparison turns out to be significant, yet would not have been with a more 
conservative comparison procedure. If in such an instance the researcher (or others who are 
familiar with the relevant literature) has reason to believe that a Type I error may have been 
committed, it would seem prudent to reevaluate the null hypothesis. In the final analysis, 
multiple replications of a result provide the most powerful evidence regarding the status of a null 
hypothesis. An effective tool that can be employed to pool the results of multiple studies which 
evaluate the same hypothesis is a methodology called meta-analysis, which is discussed in 
Section IX (the Addendum) of the Pearson product-moment correlation coefficient (Test 28). 
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At this point in the discussion it is worth reiterating the difference between statistical and 
practical significance (which is discussed in both the Introduction and in Section VI of the t 
test for two independent samples). By virtue of employing a large sample size, virtually any 
test evaluating a difference between the means of two populations will turn out to be significant. 
This results from the fact that two population means are rarely identical. However, in most 
instances a minimal difference between two means is commensurate with there being no 
difference at all, since the magnitude of such a difference is of no practical or theoretical value. 
In conducting comparisons (or, for that matter, conducting any statistical test) one must decide 
what magnitude of difference is of practical and/or theoretical significance. To state it another 
way, one must determine the magnitude of the effect size one is attempting to identify.” If a 
researcher is able to stipulate a meaningful effect size prior to collecting the data for a study, he 
can design the study so that the test which is employed to evaluate the null hypothesis is 
sufficiently powerful to identify the desired effect size. As a general rule, a researcher is best 
able to control the power of a statistical test by employing a sample size that exceeds some 
minimal value. In the latter part of Section VI the computation of power for the single-factor 
between-subjects analysis of variance is discussed in reference to both the omnibus F test as 
well as for comparison procedures. 

In the final analysis, when the magnitude of the obtained difference between the means 
involved in any comparison is deemed too small to be of practical or theoretical significance, it 
really becomes irrelevant whether a result is statistically significant. In such an instance if two 
or more comparison procedures yield conflicting results, replication of a study is not in order. 


The computation of a confidence interval for a comparison The computation of a 
confidence interval provides a researcher with a mechanism for determining a range of values 
within which he can be confident the true difference between the means of two populations falls. 
Computation of a confidence interval for a comparison is a straightforward procedure which can 
be easily implemented following the computation of a CD value.” Specifically, to compute the 
range of values that define the 9596 confidence interval for any comparison, one should do the 
following: Add to and subtract the computed value of CD from the obtained difference between 
the two means involved in the comparison. As an example, let us assume Tukey's HSD test is 
employed to compute the value CD pgp = 2.32 for the comparison involving Group 1 versus 
Group 2. To compute the 9546 confidence interval, the value 2.32 is added to and subtracted 
from 2.6, which is the difference between the two means. Thus, CI HSD 55 7 2.6 + 2.32, which 
can also be written as .28 < (u; - p,) < 4.92. In other words, the researcher can be 95% 
confident (or the probability is .95) that the mean of the population represented by Group 1 is 
between .28 and 4.92 units larger than the mean of the population represented by Group 2. If the 
researcher wants to compute the 99% confidence interval for a comparison, the same procedure 
can be used, except for the fact that in computing the CD value for the comparison, the relevant 
.01 tabled critical q value is employed (or the tabled critical .01 f, F, or fj value if the 
comparison procedure happens to employ either of the aforementioned distributions). 

It should be emphasized that the range of values that define a confidence interval will be 
a function of the tabled critical value for the relevant test statistic that a researcher elects to 
employ in the analysis. As noted earlier, the tabled critical value one employs will be a function 
of the value established for œw. In the case of the Bonferroni-Dunn test, the magnitude of the 
critical £j, value one employs (and, consequently, the range of values that defines a confidence 
interval) increases as the number of comparisons one conducts increases. 

If one has reason to believe that the homogeneity of variance assumption underlying the 
analysis of variance is violated, one can argue that the measure of error variability employed in 
computing a confidence interval should be a pooled measure of variability based only on the 
groups involved in a specific comparison. Under such circumstances, one can also argue that it 
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is acceptable to employ Equation 11.15 (employed to compute a confidence interval for the t test 
for two independent samples) to compute the confidence interval for a comparison. In using 
the latter equation for a simple comparison, one can argue that df = n, + n, - 2 (the degrees 
of freedom used for the ¢ test for two independent samples) should be employed for the 
analysis as opposed to df. The use of Equation 11.15 can also be justified in circumstances 
where: a) The researcher views the groups involved in a comparison as distinct (with respect to 
both the value of u and o?) from the other groups involved in the study; and b) The researcher 
is not attempting to control the value of a. 

In view of everything that has been said, it should be apparent that the value of a confidence 
interval can vary dramatically depending upon which of the comparison procedures a researcher 
employs, and what assumptions one is willing to make with reference to the underlying popu- 
lations under study. For this reason, two researchers may compute substantially different con- 
fidence intervals as a result of employing different comparison procedures. In the final analysis, 
however, each of the researchers may be able to offer a persuasive argument in favor of the 
methodology he employs. 


2. Comparing the means of three or more groups when k > 4 Within the framework of a 
single-factor between-subjects analysis of variance involving k = 4 or more groups, a 
researcher may wish to evaluate a general hypothesis with respect to the means of a subset of 
groups, where the number of groups in the subset is some value less than k. Although the latter 
type of situation is not commonly encountered in research, this section will describe the protocol 
for conducting such an analysis. 

To illustrate, assume that a fourth group is added to Example 21.1. Assume that the scores 
of the five subjects who serve in Group 4 are as follows: 3, 2, 1, 4, 5. Thus, YX, = 15, 
X, = 3, and EX? = 55. If the data for Group 4 are integrated into the data for the other three 
groups whose performance is summarized in Table 21.1, the following summary values are 
computed: N = nk = (5)(4) 220, XX, = 125, XX - 911. Substituting the revised values for 
k = 4 groups in Equations 21.2, 21.3, and 21.4/21.5, the following sum of squares values are 
computed: SS. = 129.75, SS,. = 96.95, SS, = 32.8. Employing the values k = 4 and 
N = 20 in Equations 21.8 and 21.9, the values df,, = 4- 1 = 3 and df,,, = 20 —- 4 = 16 
are computed. Substituting the appropriate values for the sum of squares and degrees of freedom 
in Equations 21.6 and 21.7, the values MS, = 96.95/3 = 32.32 and MS, = 32.8/16 = 2.05 
are computed. Equation 21.12 is employed to compute the value F = 32.32/2.05 = 15.77. 
Table 21.6 is the summary table of the analysis of variance. 


Table 21.6 Summary Table of Analysis of Variance 
for Example 21.1 When k = 4 


Source of variation SS df MS F 
Between-groups 96.95 3 32.32 15.77 
Within-groups 32.80 16 2.05 
Total 129.75 19 
Employing df... = 3 and dfin = 16, the tabled critical .05 and .01 values are Fy. = 3.24 


and F,, = 5.29. Since the obtained value F = 15.77 is greater than both of the aforemen- 
tioned critical values, the null hypothesis (which for k = 4 is Hy: pu, = M, = M, = p4) can 
be rejected at both the .05 and .01 levels. 

Let us assume that prior to the above analysis the researcher has reason to believe that 
Groups 1, 2, and 3 may be distinct from Group 4. However, before he contrasts the composite 
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mean of Groups 1, 2, and 3 with the mean of Group 4 (i.e., conducts the complex comparison 
which evaluates the null hypothesis Hy: (p, + m, + p4)/3 = p), he decides to evaluate the 
null hypothesis Hy: pu, = p, = p,. If the latter null hypothesis is retained, he will assume 
that the three groups share a common mean value and, on the basis of this, he will compare 
their composite mean with the mean of Group 4. In order to evaluate the null hypothesis 
Hy: Kı = p, = pa. it is necessary for the researcher to conduct a separate analysis of variance 
that just involves the data for the three groups identified in the null hypothesis. The latter 
analysis of variance has already been conducted, since it is the original analysis of variance that 
is employed for Example 21.1 — the results of which are summarized in Table 21.2. 

Upon conducting an analysis of variance on the data for all k = 4 groups as well as an 
analysis of variance on the data for the subset comprised of Kk... = 3 groups, the researcher has 
the necessary information to compute the appropriate F ratio (which will be represented with the 
notation Fa ii) for evaluating the null hypothesis Hj: p; = jj = p. In computing the F ratio 
to evaluate the latter null hypothesis, the following values are employed: a) MS, = 26.53 
(which is the value of M$,.. computed for the analysis of variance in Table 21.2 that involves 
only the three groups identified in null hypothesis H,: p, = p, = p4) is employed as the 
numerator of the F ratio; and b) MS, = 2.05 (which is the value of MS yg computed in Table 
21.6 for the omnibus F test when the data for all k = 4 groups are evaluated) is employed 
as the denominator of the F ratio. The reason for employing the latter value instead of 
MS, = 1.9 (which is the value of M$,,. computed for the analysis of variance in Table 
21.2 that only employs the data for the three groups identified in the null hypothesis 
Hy: Kı = p, = m) is because MS. = 2.05 is a pooled estimate of all k = 4 population 
variances. If, in fact, the populations represented by all four groups have equal variances, this 
latter value will provide the most accurate estimate of MS yg- Thus: 


F = 
a23) ^q 
S Wa, s) 2.05 


The degrees of freedom employed for the analysis are based on the mean square values 
employed in computing the F5, ratio. Thus: dfi, = ka, - 1-3 -1 = 2 (where 
K ea = 3 groups) and dfi, = WG srs = 16 (which is df; for the omnibus F test involv- 
ing all k = 4 groups). For df, = 2 and dfi, = 16, Fos = 3.63 and Fo = 6.23. Since 
the obtained value F = 6.47 is greater than both of the aforementioned critical values, the null 
hypothesis can be rejected at both the .05 and .01 levels. Thus, the data do not support the 
researcher’s hypothesis that Groups 1, 2, and 3 represent a homogenous subset. In view of this, 
the researcher would not conduct the contrast (X, + X, + X,)/3 versus X,. 

It should be noted that if the researcher has reason to believe that Groups 1, 2, and 3 have 
homogeneous variances, and the variance of Group 4 is not homogenous with the variance of the 


latter three groups, it would be more appropriate to employ MS Wean 7 1.9 as the denominator 


of the F ratio as opposed to MS = 2.05. With respect to the problem under discussion, 


WG 
(1/2/3/4) 
since the two MS, values are almost equivalent, using either value produces essentially the 


same result (i.e., if MS ini) = 1.9 is employed to compute the Fa ratio, Fi, = 
13.27/1.9 = 6.98, which is greater than both the tabled critical .05 and .01 F values). 


3. Evaluation of the homogeneity of variance assumption of the single-factor between- 
subjects analysis of variance In Section I it is noted that one assumption of the single-factor 
between-subjects analysis of variance is homogeneity of variance. As noted in the discussion 
of the ¢ test for two independent samples, when there are k = 2 groups the homogeneity of 
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variance assumption evaluates whether or not there is evidence to indicate an inequality exists 
between the variances of the populations represented by the two samples/groups. When there 
are two or more groups, the homogeneity of variance assumption evaluates whether there is 
evidence to indicate that an inequality exists between at least two of the population variances 
represented by the k samples/groups. When the latter condition exists it is referred to as 
heterogeneity of variance. In reference to Example 21.1, the null and alternative hypotheses 
employed in evaluating the homogeneity of variance assumption are as follows: 

Null hypothesis Hy: o; = o = o; 

(The variance of the population Group 1 represents equals the variance of the population Group 
2 represents equals the variance of the population Group 3 represents.) 


Alternative hypothesis H: Not H) 


(This indicates that there is a difference between at least two of the three population variances.) 


One of a number of procedures that can be employed to evaluate the homogeneity of 
variance hypothesis is Hartley's Fmax test (Test 11a), which is also employed to evaluate 
homogeneity of variance for the t test for two independent samples.? The reader is advised 
to review the discussion of the F max test under the ¢ test for two independent samples prior to 
continuing this section. 


Equation 21.37 (which is identical to Equation 11.6) is employed to compute the F nax test 
statistic. 
RS, 
SL ; 
Eu = (Equation 21.37) 
Ss 


Where: x = The largest of the estimated population variances of the k groups 


$8, f = The smallest of the estimated population variances of the k groups 


Employing Equation I.5, the estimated population variances are computed for the three 
groups. 


2 2 2 
a26 -0 gye C9. 303. GD" 
fact Rm darc 0$ dg 29.205 
523 5-1 5-1 


The largest and smallest estimated population variances that are employed in Equation 
21.37 are 82 = & - 2.7 and s; = s? = .7. Substituting the latter values in Equation 21.37, 
the value Fax = 3.86 is computed. 


_ 27 


max -— 
i 


- 3.86 


The value of F ax will always be a positive number that is greater than 1 (unless § 2 =5 - ; 
in which case Fax = 1). The F,,,, value obtained with Equation 21.37 is evaluated with Table 
A9 (Table of the F,,,, Distribution) in the Appendix. Since in Example 21.1, there are k = 3 
groups and n = 5 subjects per group, the tabled critical values in Table A9 that are employed are 
the values in the cell that is the intersection of the row n — 1 = 4 and the column k = 3. In order 


to reject the null hypothesis, and thus conclude that the homogeneity of variance assumption is 
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violated, the obtained F ax value must be equal to or greater than the tabled critical value at 
the prespecified level of significance. Inspection of Table A9 indicates that for n — 1 2 4 and 


k=3, F = 15.5 and F m 37. Since the obtained value Fay = 3.86 is less than 


Fus "a 15.5, the null hypothesis cannot be rejected. In other words, the alternative hypothesis 
indicating the presence of heterogeneity of variance is not supported. 

Two assumptions of the F max test are: a) Each of the samples has been randomly selected 
from the population it represents; and b) The distribution of data in the underlying populations 
from which each of the samples is derived is normal. Various sources (e.g., Keppel (1991), 
Maxwell and Delaney (1990), and Winer et al. (1991)) note that when the normality assumption 
is violated, the accuracy of the F nax test may be severely compromised. This problem becomes 
exacerbated when violation of the normality assumption occurs within the framework of an 
analysis involving small and/or unequal sample sizes. As noted in Section VI of the f test for 
two independent samples, the F nax test assumes that there are an equal number of subjects per 
group. However, if the sample sizes of the groups being compared are unequal, but are 
approximately the same value, the value of the larger sample size can be employed to represent 
n in evaluating the test statistic. Kirk (1982, 1995) and Winer et al. (1991) note that using the 
larger n will result in a slight increase in the Type I error rate for the test. 

One criticism of the F max test is that it is less powerful than some alternative but com- 
putationally more involved procedures for evaluating the homogeneity of variance assumption. 
Some of these procedures can be found in sources on analysis of variance (e.g., Keppel (1991), 
Kirk (1982, 1995), Maxwell and Delaney (1990), and Winer et al. (1991)). Among the more 
commonly cited alternatives to the F nax test are tests developed by Bartlett (1937) and Cochran 
(1941). Although both of these tests do use more information than the F max test, they are also 
subject to distortion when the underlying populations are not normally distributed. Winer et al. 
(1991) discuss tests developed by Box (1953) and Scheffé (1959) which are not as likely to be 
affected by violation of the normality assumption. Keppel (1991) and Winer et al. (1991) note 
that in a review of 56 tests of homogeneity of variance, Conover et al. (1981) recommend a test 
by Brown and Forsythe (1974a, 1974b). Howell (1992, 1997) and Maxwell and Delaney (1990), 
on the other hand, endorse the use of a test developed by O’Brien (1981). 

When all is said and done, perhaps the most reasonable approach for evaluating the 
homogeneity of variance hypothesis is to employ a methodology suggested by Keppel (1991) 
(also discussed in Keppel et al. (1992)). Keppel (1991) recommends the use of the F max test in 
evaluating the homogeneity of variance hypothesis, but notes that regardless of the values of n 
or k, if Fax 2 3 a lower level of significance (i.e., a lower a value) should be employed in 
evaluating the results of the analysis of variance in order to avoid inflating the Type I error rate 
associated with the latter test. Keppel’s (1991) strategy is based on research which indicates that 
when Fax = 3, there is an increased likelihood that the accuracy of the tabled critical values 
in the F distribution will be compromised. The fact that the value F ax = 3 is considerably 
lower than most of the tabled critical F aax Values in Table A9, reinforces what was noted 
earlier concerning the power of the Fmax test — specifically, the low power of the test may 
increase the likelihood of committing a Type II error (i.e., not rejecting the null hypothesis when 
heterogeneity of variance is present). 

When, in fact, the homogeneity of variance assumption is violated, there is an increased 
likelihood of committing a Type I error in conducting the analysis of variance evaluating the k 
group means. Factors that influence the degree to which the Type I error rate for the analysis of 
variance will be larger than the prespecified value of alpha are the size of the samples and the 
shapes of the underlying population distributions. Sources generally agree that the effect of 
violation of the homogeneity of variance assumption on the accuracy of the tabled critical F 
values is exacerbated when there are not an equal number of subjects in each group. 
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A variety of strategies have been suggested regarding how a researcher should deal with 
heterogeneity of variance. Among the procedures that have been suggested are the following: 
a) Keppel (1991) notes that one option available to the researcher is to employ an adjusted tabled 
critical F value in evaluating the analysis of variance. Specifically, one can employ a tabled 
critical value associated with a lower alpha level than the prespecified alpha level, so as to 
provide a more accurate estimate of the latter value (i.e., employ F , or F, to estimate F ,.). 
The problem with this strategy is that if too low an alpha value is employed, the power of the 
omnibus F test may be compromised to an excessive degree. Loss of power, however, can be 
offset by employing a large sample size; b) The data can be evaluated with a procedure other than 
an analysis of variance. Thus, one can employ a rank-order nonparametric procedure such as the 
Kruskal-Wallis one-way analysis of variance by ranks (Test 22). However, by virtue of 
rank-ordering the data, the latter test will usually provide a less powerful test of an alternative 
hypothesis than the analysis of variance. Another option is to employ the van der Waerden 
normal scores test for k independent samples (Test 23), which is a nonparametric procedure 
that under certain conditions can be as or more powerful than the analysis of variance. A number 
of alternative parametric procedures developed by Brown and Forsythe (1974a, 1974c), James 
(1951), and Welch (1951) are discussed in various sources. Keppel (1991) notes, however, that 
the Brown and Forsythe and Welch procedures are not acceptable when k > 4, and that James’ 
procedure is too computationally involved for conventional use. In addition, some sources 
believe that the aforementioned procedures have not been sufficiently researched to justify their 
use as an alternative to the analysis of variance, even if the homogeneity of variance assumption 
of the latter test is violated; and c) Another option available to the researcher is to equate the 
estimated population variances by employing a data transformation procedure (discussed in 
Section VII of the ¢ test for two independent samples), and to conduct an analysis of variance 
on the transformed data. 


4. Computation of the power of the single-factor between-subjects analysis of variance 
Prior to reading this section the reader may find it useful to review the discussion on power in 
Section VI of both the single-sample ¢ test (Test 2) and the ¢ test for two independent samples. 
Before conducting an analysis of variance a researcher may want to determine the minimum 
sample size required in order to detect a specific effect size. To conduct such a power analysis, 
the researcher will have to estimate the means of all of the populations that are represented by 
the experimental treatments/groups. Additionally, he will have to estimate a standard deviation 
value which it will be assumed represents the standard deviation of all k populations. 
Understandably, the accuracy of one's power calculations will be a function of the researcher's 
ability to come up with good approximations for the means and standard deviations of the 
populations that are involved in the study. The basis for making such estimates will generally 
be prior research concerning the hypothesis under study. 

Equation 21.38 is employed for computing the power of the single-factor between- 
subjects analysis of variance. The test statistic @ represents what is more formally known as 
the noncentrality parameter, and is based on the noncentral F distribution.” 


(Equation 21.38) 





Where: h; = The estimated mean of the population represented by Group j 
Hr = The grand mean, which is the average of the k estimated population means 
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ee = The estimated population variance for each of the k groups 
n= The number of subjects per group 
k = The number of groups 


The computation of the minimum acceptable sample size required to achieve a specified 
level of power is generally determined prior to collecting the data for an experiment. Using trial 
and error, a researcher can determine what value of (based on the assumption that there are an 
equal number of subjects per group) when substituted in Equation 21.38 will yield the desired 
level of power. To illustrate the use of Equation 21.38 with Example 21.1, let us assume that 
prior to conducting the study the researcher estimates that the means of the populations 
represented by the three groups are as follows: p, = 10, p, = 8, u, = 6. Additionally, it 
will be assumed that he estimates that the variance for each of the three populations the groups 
represent is Owe = 2.5.% Based on this information, the value Ur = 8 can be computed: 
Bp = Qu + p, + p,)/k = (10 + 8 + 6)/3 = 8. The appropriate values are now substituted 
in Equation 21.38. 


sal 


At this point, Table A15 (Graphs of the Power Function for the Analysis of Variance) 
in the Appendix can be employed to determine the necessary sample size required in order to 
have the power stipulated by the experimenter. Table A15 is comprised of sets of power curves 
that were derived by Pearson and Hartley (1951). Each set of curves is based on a different value 
for dum» Which in the case of a single-factor between-subjects analysis of variance is df, 
employed for the omnibus F test. Within each set of curves, for a given value of df „m there are 
power functions for both a = .05 and a = .01. For our analysis (for which it will be assumed that 
a = .05) the appropriate set of curves to employ is the set for df, = Gg = 2. Let us 
assume we want the omnibus F test to have a power of at least .80. We now substitute what we 
consider to be a reasonable value for n in the equation @ = 1.03yn (which is the result obtained 
with Equation 21.38). To illustrate, the value n = 5 (the sample size employed for Example 21.1) 
is substituted in the equation. The resulting value is @ = 1.03/5 = 2.30. 

The value ọ = 2.30 is located on the abscissa (X-axis) of the relevant set of curves in Table 
A15 — specifically, the set for df, = 2. At the point corresponding to @ = 2.30, a 
perpendicular line is erected from the abscissa which intersects with the power curve that 
corresponds to the value of df. = dfwg employed for the omnibus F test. Since dfyg = 12, 
the curve for the latter value is employed.” At the point the perpendicular intersects the curve 
dfyg = 12, a second perpendicular line is drawn in relation to the ordinate (Y-axis). The point 
at which this perpendicular intersects the ordinate indicates the power of the test. Since ọ = 2.30, 
we determine the power equals .89. Thus, if we employ five subjects per group, there is a 
probability of .89 of detecting an effect size equal to or larger than the one stipulated by the 
researcher (which is a function of the estimated values for the population means relative to the 
value estimated for the variance of a population). Since the probability of committing a Type II 
error is B2 1 — power, B2 1 — .89 = .11. The latter value represents the likelihood of not 
detecting an effect size equal to or greater than the one stipulated. 

Cohen (1977, 1988), who provides a detailed discussion of power computations for the 
analysis of variance, describes a measure of effect size for the analysis of variance based on 
standard deviation units that is comparable to the d value computed for the different types of 
t tests. In the case of the analysis of variance, d is the difference between the smallest and 





e = (1.07 = 1.03/n 
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largest of the estimated population means divided by the standard deviation of the populations. 
In other words, d = (u, - p,)o. In our example the largest estimated population mean is 
p, = 10 and the smallest is u, = 6. The value of o is Gy = "m = /2.5 = 1.58. Thus, 

= (10 - 6)/1.58 = 2.53. This result tells us that if n = 5, a researcher has a .89 probability 
of detecting a difference of about two and one-half standard deviation units. 

Itis also possible to conduct a power analysis for comparisons that are conducted following 
the computation of the omnibus F value. In fact, Keppel (1991) recommends that the sample size 
employed in an experiment be based on the minimum acceptable power necessary to detect the 
smallest effect size among all of the comparisons the researcher plans before collecting the data. 
The value D oss (described by McFatter and Gollob (1986)), which is computed with Equation 
21.39, is employed to determine the power of a comparison. 





= 2 
Promp = |R E L (Equation 21.39) 


Uowo Ec) 


Equation 21.39 can be used for both simple and complex single degree of freedom com- 
parisons. As a general rule, the equation is used for planned comparisons. Although it can be 
extended to unplanned comparisons, published power tables for the analysis of variance generally 
only apply to per comparison error rates of a = .05 and a = .01. In the case of planned and 
especially unplanned comparisons which involve a, rates other than .05 or .01, more de- 
tailed tables are required.” 

For single degree of freedom comparisons, the power curves in Table A15 for 
Gum = | are always employed. The use of Equation 21.39 will be illustrated for the simple 
comparison Group | versus Group 2 (summarized i in Table 21.3). Since Ne = 2, and we have 
estimated w, = p, = 10, p, = m, = 8, and Cue = 2.5, the following resolt is obtained. 


Pomp = 





(10 - l. 63 
agag] "7 = Ove 


Substituting n = 5 in the equation oomp = .63/n, we obtain Dom .63/5 = 1.41. 
Employing the power curves for df am = 1 with a = .05, we use the curve for dfyg = 12 (the 
dfyg employed for the omnibus F test), and determine that when d = 1.41, the power of 
the test is approximately .44. 

It should be noted that if the methodology described for computing the power of the ¢ test 
for two independent samples is employed for the above comparison or any other simple com- 
parison, it will produce the identical result. To demonstrate that Equation 21.39 produces a result 
that is equivalent to that obtained with the protocol for computing the power of the f test for two 
independent samples, Equations 11.10 and 11.12 will be employed to compute the power of the 
comparison Group 1 versus Group 2. Employing Equation 11.10: 


comp 


d hh _10-8 (54 
o 1.58 





Substituting the value of d in Equation 11.12, the value 6 = 2.01 is computed. 


è -a| -127/2 -201 
2 2 
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The value of 6 = 2.01 is evaluated with Table A3 (Power Curves for Student's t 
Distribution) in the Appendix. A full description of how to employ the power curves in Table 
A3 can be found in Section VI of the single-sample ¢ test. Employing Table A3-C (which is 
the set of curves for a two-tailed analysis with a = .05), we use the curve for df. = 12. Itis 
determined that when 6 = 2.01, the power of the comparison is approximately 44, 

A final note regarding Equat ion 21.39: The value of Xc computed for a complex 
comparison will always be lower than the value Ec computed jor a simple comparison (in the 
case of a simple comparison, XG will always equal 2. Because of this, for a fixed value of n the 
computed value of Promp will aiwaye be larger for a complex comparison and, consequently, the 
power of a complex comparison will always be greater than the power of a simple comparison. 


5. Measures of magnitude of treatment effect for the single-factor between-subjects 
analysis of variance: Omega squared (Test 21g), eta squared (Test 21h), and Cohen’s f 
index (Test 21i) Prior to reading this section the reader should review the discussion of the 
measures of magnitude of treatment effect in Section VI of the ¢ test for two independent 
samples. As is the case with the tf value computed for the latter test, the omnibus F value 
computed for the single-factor between-subjects analysis of variance only provides a 
researcher with information regarding whether the null hypothesis can be rejected — i.e., 
whether a significant difference is present between at least two of the experimental treatments. 
The F value (as well as the level of significance with which it is associated), however, does not 
provide the researcher with any information regarding the size of any treatment effect that is 
present. As is the case with a t value, an F value is a function of both the difference between the 
means of the experimental treatments and the sample size. The measures described in this section 
are variously referred to as measures of effect size, measures of magnitude of treatment 
effect, measures of association, and correlation coefficients. 


Omega squared (Test 21g) A number of measures of the magnitude of treatment effect have 
been developed which can be employed for the single-factor between-subjects analysis of 
variance. Such measures, which are independent of sample size, provide an estimate of the pro- 
portion of variability on the dependent variable that is associated with the independent variable/ 
experimental treatments. Although sources are not in total agreement with respect to which 
measure of treatment effect is most appropriate for the single-factor between-subjects analysis 
of variance, one commonly employed measure is the omega squared statistic (67), which 
provides an estimate of the underlying population parameter o). The value of à? is computed 
with Equat ion 21.41 The? value computed with Equat ion 21.41 is the best estimate of the 
proportion of variability in the data that is attributed to the experimental treatments. It is obtained 
by dividing treatment variability (ope ) by total variability (which equals do * aro Thus, 
Equat ion 21.40 represents the populationparameter estimated by Equat ion 21.41. 


2 
[0] 
o = — (Œ quation 21.40) 


E 
OgG * Owg 


SS, - (k - DMS 
ge - 95e € 7 DMS ic (E quation 21.41) 
SS, + MSwe 


Although the value of @? will ey fall in the range between 0 and 1, when F < 1 
6” will be a negative number. The closer 7 is to 1, the stronger the association between the 
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independent and dependent variables, whereas the closer &? is to 0, the weaker the association 
between the two variables. A 6 value equal to or less than 0 indicates that there is no 
association between the variables. Keppel (1991) notes that in behavioral science research 
(which is commonly characterized by a large amount of error variability) the value of 6? will 
rarely be close to 1. 

Employing Equation 21.41 with the data for Example 21.1, the value à = .44 is 
computed. 


sp 26ST AS ed Lus 
49.33 + 1.9 





Equation 21.42 is an alternative equation for computing the value of à? that yields the 
same value as Equation 21.41.” 
o cos IE SD 


TXECCDUCSd) UE (Equation 21.42) 


o (2)(6.98 - 1) D 
(26.98 - 1) + (5)(3) 





The value à? = .44 indicates that 44% (or a proportion of .44) of the variability on the 
dependent variable (the number of nonsense syllables correctly recalled) is associated with 
variability on the levels of the independent variable (noise). To say it another way, 44% of the 
variability on the recall scores of subjects can be accounted for on the basis of which group a 
subject is a member. As noted in the discussion of omega squared in Section VI of the ¢ test 
for two independent samples, Cohen (1977; 1988, pp. 284—287) has suggested the following 
(admittedly arbitrary) values, which are employed in psychology and a number of other 
disciplines, as guidelines for interpreting @*: a) A small effect size is one that is greater than 
.0099 but not more than .0588; b) A medium effect size is one that is greater than .0588 but not 
more than .1379; c) A large effect size is greater than .1379. If one employs Cohen's (1977, 
1988) guidelines for magnitude of treatment effect, à? = .44 represents a large treatment effect. 


Eta squared (Test 16h) Another measure of treatment effect employed for the single-factor 


between-subjects analysis of variance is the eta squared statistic (Ñ?) (which estimates the 
underlying population parameter n?). fi? is computed with Equation 21.43. 


(Equation 21.43) 





Employing Equation 21.43 with the data for Example 21.1, the value fj? = .54 is com- 
puted. 


[i 
N 
N 


6.5 
= £O 2.54 
3 9.3 


[9v] 


pA 
O2 


Note that the value Ñ? = .54 is larger than @ = .44 computed with Equations 
21.41/21.42. Sources note that fj? is a more biased estimate of the magnitude of treatment 
effect in the underlying population than is @*, since fj? employs the values SS,c; and SS. 
which by themselves are biased estimates of population variability. Darlington and Carlson 
(1987) note that Equation 21.44 can be employed to compute a less biased estimate of the 
population parameter that is estimated by 77. 
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Adjusted t = 1 - (Equation 21.44) 


T 
SS, 
N-1 





Where: MS; s 


Since MS, = 49.33/14 = 3.52, the adjusted fj = .46 falls in between à? = .44 and 
Ñ? = .54 computed with Equations 21.41/21.42 and Equation 21.43. 


Adjusted #2 = 1 - is - 46 


When à? and f)? are based on a small sample size, their standard error (which is a measure 
of error variability) will be large and, consequently, the reliability of the measures of treatment 
effect will be relatively low. The latter will be reflected in the fact that under such conditions a 
confidence interval computed for a measure of treatment effect (the computation of which will 
not be described in this book) will have a wide range. It should be emphasized that a measure 
of magnitude of treatment effect is a measure of association/correlation and, in and of itself, it 
is not a test of significance. The significance of à? and fi? is based on whether or not the 
omnibus F value is significant? 


Cohen's f index (Test 21i) Cohen (1977, 1988) describes an index of effect size that can be 
employed with the single-factor between-subjects analysis of variance which he designates 
as f. Cohen's f index is a generalization of his d effect size index in the case of three or more 
means. The d index (Test 2a) was employed to compute the power of the single-sample f test 
(Test 2), the ¢ test for two independent samples, and the ¢ test for two dependent samples 
(Test 17). Kirk (1995, p. 181) notes that the value of f can be computed with either Equation 
21.45 or Equation 21.46. When the latter equations are employed for Example 21.1, the value 
f= .89 is obtained. 

















7 oO | 44 | 
[e wer EE Fee dia .89 (Equation 21.45) 
(Equation 21.46) 
E lus,s - MS yo) EIE - 1.90] 
pa (5)(3) 2 
MS, 1.90 


It should be noted that although Cohen (1977; 1988, p. 284) employs the notation for eta 
squared in Equation 21.45 in place of omega squared, the definition of the statistic he uses is 
consistent with the definition of omega squared. If for Example 21.1, one elects to employ the 
values f? = .54 or Adjusted f? = .46 in Equation 21.45, the values f= 1.08 and f = .92 are 
computed. 

Cohen (1977; 1988, pp. 284—288) has proposed the following (admittedly arbitrary) f values 
as criteria for identifying the magnitude of an effect size: a) A small effect size is one that is 
greater than .1 but not more than .25; b) A medium effect size is one that is greater than .25 but 
not more than .4; and c) A large effect size is greater than .4. Employing Cohen's criteria, the 
value f = .89 represents a large effect size. The f effect size index is commonly employed in 
computing the power of the single-factor between-subjects analysis of variance (as well as the 
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power of the analysis of variance when it is employed with other designs). Cohen (1977; 1988, 
Chapter 8) contains power tables for the analysis of variance that employ f'as a measure of effect 
size. 

Equation 21.47 allows one to convert an f value into an à? or eta squared value. (As was 
the case with Equation 21.45, although Cohen (1977; 1988, p. 281) employs the notation for eta 
squared in Equation 21.47 in place of omega squared, the definition of the statistic he uses is 
consistent with the definition of omega squared.) 


2 2 
o- 2. 6089... 4 (Equation 21.47) 
1f 1-89 


If the values f= 1.08 and f= .92 are substituted in Equation 21.47, the values fj? = .54 and 
Adjusted Ñ? = .46 are computed. The choice of employing omega squared or eta squared in 
Equation 21.45 is up to the researcher's discretion. Let us assume that for Example 21.1, 
f= .89. Cohen’s (1977, 1988) interpretation of an f value of .89 is that it represents a standard 
deviation of the three population means that is .89 times as large as the standard deviation of the 
observations within the populations. He notes that f will equal O when the k treatment means 
are equal, and continue to increase as the ratio of between-groups variability to within-groups 
variability gets larger. 

Many sources recommend that in summarizing the results of an experiment, in addition to 
reporting the omnibus test statistic (e.g., an F or t value), a measure of magnitude of treatment 
effect also be included, since the latter can be useful in further clarifying the nature of the rela- 
tionship between the independent and dependent variables. It is important to note that if the 
value of a measure of magnitude of treatment effect is small, it does not logically follow that the 
relationship between the independent and the dependent variables is trivial. There are instances 
when a small treatment effect may be of practical and/or theoretical value (an illustration of this 
is provided in Section IX (the Addendum) of the Pearson product-moment correlation 
coefficient, under the discussion of meta-analysis and related topics). It should be noted that 
when the independent variable is a nonmanipulated variable, it is possible that any treatment 
effect that is detected may be due to some variable other than the independent variable. Such 
studies (referred to as ex post facto studies) do not allow a researcher to adequately control for 
the potential effects of extraneous variables on the dependent variable. As a result of this, even 
if a large treatment effect is present, a researcher is not justified in drawing conclusions with 
regard to cause and effect — specifically, the researcher cannot conclude that the independent 
variable is responsible for group differences on the dependent variable. Other sources which 
discuss measures of magnitude of treatment effect for the single-factor between-subjects 
analysis of variance are Howell (1992, 1997), Keppel (1991), Kirk (1982, 1995), Maxwell and 
Delaney (1990), and Winer et al. (1991)? Further discussion of the indices of treatment effect 
discussed in this section, and the relationship between effect size and statistical power can be 
found in Section IX (the Addendum) of the Pearson product-moment correlation coefficient 
under the discussion of meta-analysis and related topics. 


6. Computation of a confidence interval for the mean of a treatment population Prior to 
reading this section the reader may find it useful to review the discussion of confidence intervals 
in Section VI of the single-sample ¢ test. Equation 21.48 can be employed to compute a 
confidence interval for the mean of a population represented by one of the k treatments for 
which an omnibus F value has been computed.9? 
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ES (Equation 21.48) 
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Where: ty represents the tabled critical two-tailed value in the t distribution, for df yg, below 
wg 


which a proportion (percentage) equal to [1 — (0/2)] of the cases falls. If the pro- 
portion (percentage) of the distribution that falls within the confidence interval is 
subtracted from 1 (10096), it will equal the value of a. 


The use of the tabled critical t value for df, in Equation 21.48 instead of a degrees of 
freedom value based on the number of subjects who served in each group (i.e., df 2 n — 1) is 
predicated on the fact that the estimated population variance is based on the pooled variance of 
the k groups. If, however, there is reason to believe that the homogeneity of variance 
assumption is violated, it is probably more prudent to employ Equation 2.7 to compute the 
confidence interval. The latter equation employs the estimated population standard deviation for 
a specific group for which a confidence interval is computed, rather than the pooled variability 
(which is represented by MS\,,,). If the standard deviation of the group is employed rather than 
the pooled variability, df = n — 1 is used for the analysis. It can also be argued that it is 
preferable to employ Equation 2.7 in computing the confidence interval for a population mean 
in circumstances where a researcher has reason to believe that the group in question represents 
a population that is distinct from the populations represented by the other (k — 1) groups. 
Specifically, if the mean value of a group is significantly above or below the means of the other 
groups, it can be argued that one can no longer assume that the group shares a common variance 
with the other groups. In view of this, one can take the position that the variance of the group 
is the best estimate of the variance of the population represented by that group (as opposed to the 
pooled variance of all of the groups involved in the study). The point to be made here is that, as 
is the case in computing a confidence interval for a comparison, depending upon how one 
conceptualizes the data, more than one methodology can be employed for computing a 
confidence interval for a population mean. 

The computation of the 9596 confidence interval for the mean of the population represented 
by Group 1 is illustrated below employing Equation 21.48. The value £4, = 2.18 is the tabled 
critical two-tailed ft), value for dfy, = 12. 


Cy, = 92 + 2.18 = = 9.2 + 1.34 


Thus, the researcher can be 95% confident (or the probability is .95) that the mean of the 
population represented by Group 1 falls within the range 7.86 to 10.54. Stated symbolically: 
7.86 < u, < 10.54. 

If, on the other hand, the researcher elects to employ Equation 2.7 to compute the 
confidence interval for the mean of the population represented by Group 1, the standard deviation 
of Group 1 is employed in lieu of the pooled variability of all the groups. In addition, the tabled 
critical two-tailed ft), value for df = n — 1 = 4 is employed in Equation 2.7. The computations 
are shown below. 
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Clos = X, t vo =92 + an|] =92 + 1.04 


Js 


Thus, employing Equation 2.7, the range for CI}; is 8.16 < pu, < 10.24. Note that 
although the range of values computed with Equation 21.48 and Equation 2.7 are reasonably 
close to one another, the range of the confidence interval computed with Equation 21.48 is wider. 


Depending on the values of faf Versus f, dn and ,/MS,,,,/n versus Sg, there will usually be 


4 discrepancy between the Ki EM interval computed with the two equations. When the 
estimated population variances of all k groups are equal (or reasonably close to one another), 
Equation 2.7 will yield a wider confidence interval than Equation 21.48, since ,/MS,,/n will 
equal sy x, and lig diye < ti dio In the example illustrated in this section, the reason sy use of 
Equation 2.7 yields a smaller confidence interval than Equation 21.48 (in spite of the fact that 
a larger t value is employed in Equation 2.7) is because $? = .7 (the estimated variance of the 
population represented by Group 1) is substantially less than MS, = 1.9 (the pooled estimate 
of within-groups variability). 


VII. Additional Discussion of the Single-Factor Between-Subjects 
Analysis of Variance 


1. Theoretical rationale underlying the single-factor between-subjects analysis of variance 
In the single-factor between-subjects analysis of variance it is assumed that any variability 
between the means of the k groups can be attributed to one or both of the following two elements: 
a) Experimental error; and b) The experimental treatments. When MS, (the value com- 
puted for between-groups variability) is significantly greater than MS, (the value computed for 
within-groups variability), it is interpreted as indicating that a substantial portion of between- 
groups variability is due to a treatment effect. The rationale for this is as follows. 
Experimental error is random variability in the data that is beyond the control of the 
researcher. In an independent groups design the average amount of variability within each of the 
k groups is employed to represent experimental error. Thus, the value computed for MS, is 
the normal amount of variability that is expected between the scores of different subjects who 
serve in the same group. Within this framework, within-groups variability is employed as a base- 
line to represent variability which results from factors that are beyond an experimenter's control. 
The experimenter assumes that since such uncontrollable factors are responsible for within- 
groups differences, it is logical to assume that they can produce differences of a comparable 
magnitude between the means of the k groups. As long as the variability between the group 
means (MS,,,) is approximately the same as within-groups variability (MS yg), the experimenter 
can attribute any between-groups variability to experimental error. When, however, between- 
groups variability is substantially greater than within-groups variability, it indicates that 
something over and above error variability is contributing to the variability between the k group 
means. In such a case, itis assumed that a treatment effect is responsible for the larger value of MS, 
relative to the value of MS,,. In essence, if within-groups variability is subtracted from 
between-groups variability, any remaining variability can be attributed to a treatment effect. If 
there is no treatment effect, the result of the subtraction will be zero. Of course, one can never 
completely rule out the possibility that if MS, is larger than MS, the larger value for MS, 
is entirely due to error variability. However, since the latter is unlikely, when MS, is 
significantly larger than MS g. it is interpreted as indicating the presence of a treatment effect. 
Tables 21.7 and 21.8 will be used to illustrate the relationship between between-groups 
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variability and within-groups variability in the analysis of variance. Assume that both tables con- 
tain data for two hypothetical studies employing an independent groups design involving k = 3 
groups and n = 3 subjects per group. In the hypothetical examples below, even though it is not 
employed as the measure of variability in the analysis of variance, the range (the difference 
between the lowest and highest score) will be used as a measure of variability. The range is 
employed since: a) It is simpler to employ for purposes of illustration than the variance; and b) 
What is derived from the range from this example can be generalized to the variance/mean 
squares as they are employed within the framework of the analysis of variance. 

Table 21.7 presents a set of data where there is no treatment effect, since between-groups 
variability equals within-groups variability. In Table 21.7, in order to assess within-groups 
variability we do the following: a) Compute the range of each group. This is done by subtracting 
the lowest score in each group from the highest score in that group; and b) Compute the average 
range of the three groups. Since the range for all three groups equals 2, the average range equals 
2 (i.e., (2 + 2 + 2)/3 22). This value will be used to represent within-groups variability (which 
is a function of individual differences in performance among members of the same group). 


Table 21.7 Data Illustrating No Treatment Effect 


Group 1 Group 2 Group 3 
3 4 5 
4 5 6 
5 6 7 
X -4 X, = 5 X, = 6 
Table 21.8 Data Illustrating Treatment Effect 
Group 1 Group 2 Group 3 
3 10 30 
4 11 31 
5 12 32 
X,=4 X, = ll X, = 31 


Between-groups variability is obtained by subtracting the lowest group mean (X, = 4) 
from highest group mean (X, = 6). Thus, since X, - X, = 6 — 4 = 2, between-groups vari- 
ability equals 2. As noted previously, if between-groups variability is significantly larger than 
within-groups variability, itis interpreted as indicating the presence of a treatment effect. Since, 
in the example under discussion, both between-groups and within-groups variability equal 2, 
there is no evidence of a treatment effect. To put it another way, if within-groups variability is 
subtracted from between-groups variability, the difference is zero. In order for there to be a 
treatment effect, a positive difference should be present. 

If, for purposes of illustration, we employ the range in the F ratio to represent variability, 
the value of the F will equal 1 since: 


F = Between-groups variability _ 2 _ 1 


Within-groups variability 2 





As noted previously, when F = 1, there is no evidence of a treatment effect, and thus the 
null hypothesis is retained.” 
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Table 21.8 presents a set of data where a treatment effect is present as a result of between- 
groups variability being greater than within-groups variability. In Table 21.8, if we once again 
use the range to assess variability, within-groups variability equals 2. This is the case since the 
range of each group equals 2, yielding an average range of 2. The between-groups variability, 
on the other hand, equals 27. The latter value is the obtained by subtracting the smallest of the 
three group means X, - 4 from the largest of the group means X, - 31. Thus, X, - X, 
—31 - 4-27. Because the value 27 is substantially larger than 2 (which is a baseline measure 
of error variability that will be tolerated between the group means), there is strong evidence that 
a treatment effect is present. Specifically, since we will assume that only 2 of the 27 units that 
comprise between-groups variability can be attributed to experimental error, the remaining 27 
— 2 = 25 units can be assumed to represent the contribution of the treatment effect. 

If once again, for purposes of illustration, the values of the range are employed to compute 
the F ratio, the resulting value will be F = 27/2 = 13.5. As noted previously, when the value of 
F is substantially greater than 1, it is interpreted as indicating the presence of a treatment effect, 
and consequently the null hypothesis is rejected. 


2. Definitional equations for the single-factor between-subjects analysis of variance In the 
description of the computational protocol for the single-factor between-subjects analysis of 
variance in Section IV, Equations 21.2, 21.3, and 21.5 are employed to compute the values 
SS,, SS,í. and SS. The latter set of computational equations was employed, since it 
allows for the most efficient computation of the sum of squares values. As noted in Section IV, 
computational equations are derived from definitional equations which reveal the underlying 
logic involved in the derivation of the sums of squares. This section will describe the definitional 
equations for the single-factor between-subjects analysis of variance, and apply them to 
Example 21.1 in order to demonstrate that they yield the same values as the computational 
equations. 

As noted previously, the total sum of squares (55,.) is made up of two elements, the 
between-groups sum of squares (SS,,.) and the within-groups sum of squares (SS yg). The 
contribution of any single subject's score to the total variability in the data can be expressed in 
terms of a between-groups component and a within-groups component. When the 
between-groups component and the within-groups component are added, the sum reflects that 
subject's total contribution to the overall variability in the data. The contribution of all N subjects 
to the total variability (SS,) and the elements that comprise it (SS,,, and SS yg) are summarized 
in Table 21.9. The definitional equations described in this section employ the following notation: 
X; represents the score of the i” subject in the j” group; X, represents the grand mean (which 
is X, = QOL /N = 110/15 = 7.33); and X, represents the mean of the j " group. 

Equation 21.49 is the definitional equation for the total sum of squares.” 


n 


k — 
S» Y (X; - XJ (Equation 21.49) 
ja i=l 


In employing Equation 21.49 to compute 55.., the grand mean ( X...) is subtracted from each 
of the N scores and each of the N difference scores is squared. The total sum of squares (SS,,.) 
is the sum of the N squared difference scores. Equation 21.49 is computationally equivalent to 
Equation 21.2. 

Equation 21.50 is the definitional equation for the between-groups sum of squares. 


k — rm 
SS5 = ny; Z = xy (Equation 21.50) 
ja 


In employing Equation 21.50 to compute SS 


pc the following operations are carried out for 
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each group. The grand mean ( X;.) is subtracted from the group mean (X). The difference score 
is squared, and the squared difference score is multiplied by the number of subjects in the group 
(n). After this is done for all k groups, the values that have been obtained for each group as a re- 
sult of multiplying the squared difference score by the number of subjects in a group are summed. 
The resulting value represents the between-groups sum of squares (SS,,.). Equation 21.50 is 
computationally equivalent to Equation 21.3. An alternative but equivalent method of obtaining SS, 
(which is employed in deriving SS, in Table 21.9) is as follows: Within each group, for each 
of the n subjects the grand mean is subtracted from the group mean, each difference score is 
squared, and upon doing this for all k groups, the N squared difference scores are summed. 
Equation 21.51 is the definitional equation for the within-groups sum of squares. 


k n _ 
O 2 ae (Equation 21.51) 


jel id 


Table 21.9 Computation of Sums of Squares for Example 21.1 with Definitional Equations 


k n = K = k n 
Xj Swe = E D^ - XP SS, = "n - X} —8$,- 2. Y, - xy 
8 (8-9.2 = 1.44 (9.2-7.33Y = 3.497 (8-7.33" = .449 
10 (10-9.2? = .64 (9.2-7.33Y = 3.497 (10-7.33 = 7.129 
Group 1 9 (9-9.2? = .04 (9.2-7.33Y = 3.497 (9-7.33y = 2.789 
10 (10-9.2? = .64 (9.2-7.33Y = 3.497 (10-7.33) = 7.129 
9 (9-9.2? = .04 (9.2-7.33Y = 3.497 (9-7.33) = 2.789 
7 (7-6.6 = .16 (6.6-7.33" = .533 (7-7.33) = .109 
8 (8-6.6) = 1.96 (6.67.33) = .533 (8-7.33) = .449 
Group2 5 (5-6.6 = 2.56 (6.67.33) = .533 (5-7.33y = 5.429 
8 (8-6.6) = 1.96 (6.6-7.33) = .533 (8-7.33 = .449 
5 (5-6.6) = 2.56 (6.6-7.33) = .533 (5-7.33y = 5.429 
4 (46.27 = 4.84 (6.2-7.33Y = 1.277 (4-7.33)} = 11.089 
8 (8-6.2? = 324 (6.2-7.33) = 1.277 (8-7.33" = .449 
Group3 7 (1-6.2? = .64 (6.2-7.33) = 1.277 (7-7.33y = .109 
5 (5-6.2 = 1.44 (6.2-7.33) = 1.277 (5-7.33y = 5.429 
7 (1-6.2? = .64 (6.2-7.33 = 1.277 (7-7.33" = .109 
SSwo = 22.80 SSpg = 26.535 SS, = 49.335 


In employing Equation 21.51 to compute SS yg» the following operations are carried out 
for each group. The group mean ( X.) is subtracted from each score in the group. The difference 
scores are squared, after which the sum of the squared difference scores is obtained. The sum 
of the sum of the squared difference scores for all k groups represents the within-groups sum of 
squares. Equation 21.51 is computationally equivalent to Equation 21.5. 

Table 21.9 illustrates the use of Equations 21.49, 21.50, and 21.51 with the data for 
Example 21.1. The resulting values of SS,, SS, í and SS, are identical to those obtained 
with the computational equations (Equations 21.2, 21.3, and 21.5). Any minimal discrepancies 
are the result of rounding off error. 


3. Equivalency of the single-factor between-subjects analysis of variance and the t test for 
two independent samples when k 22 Interval/ratio data for an experiment involving k = 2 
independent groups can be evaluated with either a single-factor between-subjects analysis of 
variance or a f test for two independent samples. When both of the aforementioned tests are 
employed to evaluate the same set of data, they will yield the same result. Specifically, the 
following will always be true with respect to the relationship between the computed F and t values 
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for the same set of data: F = t? and t = yF. It will also be the case that the square of the tabled 
critical t value at a prespecified level of significance for df = n, + n, - 2 will be equal to the 
tabled critical F value at the same level of significance for df, = 1 and df, (which will be 
dfyg = N - k = N - 2, which is equivalent to the value df = n, + n, - 2 employed for the 
t test for two independent samples). 

To illustrate the equivalency of the results obtained with the single-factor between- 
subjects analysis of variance and the ¢ test for two independent samples when k = 2, an F 
value will be computed for Example 11.1. The value t = - 1.96 (t = - 1.964 if carried out to 3 
decimal places) is obtained for the latter example when the f test for two independent samples 
is employed. When the same set of data is evaluated with the single-factor between-subjects 
analysis of variance, the value F = 3.86 is computed. Note that (t = -1.964)? = (F = 3.86). 
Equations 21.2, 21.3, and 21.4 are employed below to compute the values SS,, SS,,, and SS, 
for Example 11.1. Sinceek 22, n = 5, and nk = N = 10 = 10, df,, = 2 - 1 = 1, dfyg =N- 
k=10-2=8,and df, = N -1 = 10 - 1 = 9. The full analysis of variance is summarized in 
Table 21.10. 








(537 
SS= 473 - 27 = 192.1  S8,.- 
» 10 Bg 5 5 10 





Q4» e» 263 _ 6s 


SSwoq = 192.1 - 62.5 = 129.6 


Table 21.10 Summary Table of Analysis of Variance for Example 11.1 


Source of variation SS df MS F 
Between-groups 62.5 1 62.5 3.86 
Within-groups 129.6 8 16.2 

Total 192.1 9 


For dfg = 1 and dfy, = 8, the tabled critical .05 and .01 values are Fy, = 5.32 and 
F 4, = 11.26 (which are appropriate for a nondirectional analysis). Note that (if one takes into 
account rounding off error) the square roots of the aforementioned tabled critical values are (for 
df = 8) the tabled critical two-tailed values t,, = 2.31 and t,, = 3.36 that are employed in 
Example 11.1 to evaluate the value t = —1.96. Since the obtained value F = 3.86 is less than 
the tabled critical .05 value F,. = 5.32, the nondirectional alternative hypothesis H,: p; * m, 
is not supported. The directional alternative hypothesis H,: 4, < p, is supported at the .05 
level, since F = 3.86 is greater than the tabled critical one-tailed .05 value F4, = 3.46 (the 
square root of which is the tabled critical one-tailed .05 value t,, = 1.86 employed in Example 
11.1). The directional alternative hypothesis H,: p, < p, isnotsupported at the .01 level, since 
F = 3.86 is less than the tabled critical one-tailed .01 value F4, = 8.41 (the square root of which 
is the tabled critical one-tailed .01 value t,, = 2.90 employed in Example 11.1). The con- 
clusions derived from the single-factor between-subjects analysis of variance are identical to 
those reached when the data are evaluated with the ¢ test for two independent samples. 


4. Robustness of the single-factor between-subjects analysis of variance The general com- 
ments made with respect to the robustness of the t test for two independent samples (in Section 
VII of the latter test) are applicable to the single-factor between-subjects analysis of variance. 
Most sources state that the single-factor between-subjects analysis of variance is robust with 
respect to violation of its assumptions. Nevertheless, when either the normality and/or homogene- 
ity of variance assumption is violated, it is recommended that a more conservative analysis be 
conducted (i.e., employ the tabled F, value or even a tabled value for a lower alpha level to 
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represent the F, value). When the violation of one or both assumptions is extreme, some 
sources recommend that a researcher consider employing an alternative procedure. Alternative 
procedures are discussed earlier in this section under the homogeneity of variance assumption. 
Keppel (1991) and Maxwell and Delaney (1990) provide comprehensive discussions on the 
general subject of the robustness of the single-factor between-subjects analysis of variance. 


5. Fixed-effects versus random-effects models for the single-factor between-subjects 
analysis of variance The terms fixed- versus random-effects models refer to the way in which 
a researcher selects the levels of the independent variables that are employed in an experiment. 
Whereas a fixed-effects model assumes that the levels of the independent variable are the same 
levels that will be employed in any attempted replication of the experiment, a random-effects 
model assumes that the levels have been randomly selected from the overall population of all 
possible levels that can be employed for the independent variable. The discussion of the single- 
factor between-subjects analysis of variance in this book assumes a fixed-effects model. With 
the exception of the computation of measures of magnitude of treatment effect, the equations 
employed for the fixed-effects model are identical to those that are employed for a random- 
effects model. However, in the case of more complex designs (within-subjects and factorial 
designs), the computational procedures for fixed- versus random-effects models may differ.5é 
The degree to which a researcher may generalize the results of an experiment will be a direct 
function of which model is employed. Specifically, if a fixed-effects model is employed, one can 
only generalize the results of an experiment to the specific levels of the independent variable that 
are used in the experiment. On the other hand, if a random-effects model is employed, one can 
generalize the results to all possible levels of the independent variable. 


6. Multivariate analysis of variance MANOVA) The multivariate analog of the single- 
factor between-subjects analysis of variance is the multivariate analysis of variance 
(MANOVA), which is one of a number multivariate statistical procedures discussed in the book. 
As noted under the discussion of Hotelling's T? (discussed in Section VII of the ¢ test for two 
independent samples), the term multivariate is employed in reference to procedures that 
evaluate experimental designs in which there are multiple independent variables and/or multiple 
dependent variables. The MANOVA is a generalization of Hotelling's T? to experimental 
designs involving more than two groups. In point of fact, Hotelling's T? represents a special 
case of the multivariate analysis of variance (MANOVA). The MANOVA can be employed 
to analyze the data for an experiment that involves a single independent variable comprised of 
two or more levels and multiple dependent variables. With regard to the latter, instead of a single 
score, each subject produces scores on two or more dependent variables. To illustrate, let us 
assume that in Example 21.1 two scores are obtained for each subject. One score represents the 
number of correct responses, and a second score represents response latency (i.e., speed of 
response). Within the framework of the MANOVA, a composite mean based on both the 
number of correct responses and response latency scores is computed for each group. The latter 
composite means are referred to as mean vectors or centroids. As is the case with the single- 
factor between-subjects analysis of variance, the means (in this case, composite) for the three 
groups are then compared with one another. 

Stevens (1996) notes the following reasons why a researcher should consider employing 
a MANOVA instead of an analysis of variance procedure which evaluates just a single dependent 
variable: a) Most treatments that have an effect on subjects will impact them in a variety of ways. 
Consequently, an experimental design that evaluates multiple effects accruing from a treatment 
will provide a more comprehensive picture of the overall impact of the treatment; b) As noted in 
Section VI, conducting multiple statistical tests can inflate the overall Type I error rate associated 
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with a body of data. By employing multiple dependent variables within the framework of a 
single study, a researcher can avoid the inflated Type I error rate that can result from conducting 
separate studies, with each one requiring an analysis of a single dependent variable; c) Univariate 
analysis of variance does not take into consideration that two or more potential dependent vari- 
ables may be correlated with one another, whereas a MANOVA takes such intercorrelations into 
account in computing the MANOVA test statistic; and d) The effect of one or more dependent 
variables by themselves may not be strong enough to yield a significant result, yet when 
employed together within the framework of a design that is evaluated with a MANOVA, their 
combined effect may be significant. 

Like most multivariate procedures, the mathematics involved in conducting MANOVA is 
quite complex, and for this reason it becomes laborious if not impractical to implement without 
the aid of a computer. Since a full description of MANOVA is beyond the scope of this book, 
the interested reader should consult sources such as Stevens (1986, 1996) and Tabachnick and 
Fidell (1989, 1996) which describe multivariate procedures in detail. 


VIII. Additional Examples Illustrating the Use of the Single-Factor 
Between-Subjects Analysis of Variance 


Since the single-factor between-subjects analysis of variance can be employed to evaluate 
interval/ratio data for any independent groups design involving two or more groups, it can be 
used to evaluate any of the examples that are evaluated with the f test for two independent 
samples (with the exception of Example 11.3). Examples 21.2, 21.3, and 21.4 in this section are 
extensions of Examples 11.1, 11.4, and 11.5 to a design involving k = 3 groups. Since the data 
for all of the examples are identical to the data employed in Example 21.1, they yield the same 
result. 


Example 21.2 In order to assess the efficacy of a new antidepressant drug, 15 clinically de- 
pressed patients are randomly assigned to one of three groups. Five patients are assigned to 
Group 1, which is administered the antidepressant drug for a period of six months. Five patients 
are assigned to Group 2, which is administered a placebo during the same six-month period. 
Five patients are assigned to Group 3, which does not receive any treatment during the six-month 
period. Assume that prior to introducing the experimental treatments, the experimenter con- 
firmed that the level of depression in the three groups was equal. After six months elapse, all 15 
subjects are rated by a psychiatrist (who is blind with respect to a subject's experimental 
condition) on their level of depression. The psychiatrist's depression ratings for the five subjects 
in each group follow. (The higher the rating, the less depressed a subject.) Group 1: 8, 10, 9, 
10, 9; Group 2: 7, 8, 5, 8, 5; Group 3: 4, 8, 7, 5, 7. Do the data indicate that the 
antidepressant drug is effective? 


Example 21.3 A researcher wants to assess the relative effect of three different kinds of 
punishment on the emotionality of mice. Each of 15 mice is randomly assigned to one of three 
groups. During the course of the experiment each mouse is sequestered in an experimental 
chamber. While in the chamber, each of the five mice in Group 1 is periodically presented with 
a loud noise, each of the five mice in Group 2 is periodically presented with a blast of cold air, 
and each of the mice in Group 3 is periodically presented with an electric shock. The 
presentation of the punitive stimulus for each of the animals is generated by a machine that 
randomly presents the stimulus throughout the duration of the time it is in the chamber. The 
dependent variable of emotionality employed in the study is the number of times each mouse 
defecates while in the experimental chamber. The number of episodes of defecation for the 15 
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mice follow: Group 1: 8, 10, 9, 10, 9; Group 2: 7, 8, 5, 8, 5; Group 3: 4, 8, 7, 5, 7. Do 


subjects exhibit differences in emotionality under the different experimental conditions? 


Example 21.4 Each of three companies that manufacture the same size precision ball bearing 
claims it has better quality control than its competitor. A quality control engineer conducts a 
study in which he compares the precision of ball bearings manufactured by the three companies. 
The engineer randomly selects five ball bearings from the stock of Company A, five ball bearings 
from the stock of Company B, and five ball bearings from the stock of Company C. He measures 
how much the diameter of each of the 15 ball bearings deviates from the manufacturer's spec- 
ifications. The deviation scores (in micrometers) for the 15 ball bearings manufactured by the 
three companies follow: Company A: 8, 10, 9, 10, 9; Company B: 7,8,5,8, 5; Company C: 
4, 8, 7, 5, 7. What can the engineer conclude about the relative quality control of the three 
companies? 


IX. Addendum 


Test 21j: The single-factor between subjects analysis of covariance Analysis of covariance 
(for which the acronym ANCOVA is commonly employed) is an analysis of variance procedure 
that employs a statistical adjustment (involving regression analysis, which is discussed under the 
Pearson product-moment correlation coefficient), to control for the effect of one or more 
extraneous variables on a dependent variable.” Although it is possible to employ multiple 
extraneous variables within the framework of an analysis of covariance, in this section the single- 
factor between-subjects analysis of covariance involving one extraneous variable will be dis- 
cussed. Analysis of covariance is an analysis of variance procedure which utilizes data on an 
extraneous variable that has a linear correlation with the dependent variable. Such an extraneous 
variable is referred to as a covariate or concomitant variable. By utilizing the correlation be- 
tween the covariate and the dependent variable, the researcher is able to remove variability on 
the dependent variable that is attributable to the covariate. The effect of the latter is a reduction 
in the error variability employed in computing the F ratio, thereby resulting in a more powerful 
test of the alternative hypothesis. A second potential effect of an analysis of covariance is that 
by utilizing the correlation between scores on a covariate and the dependent variable, the mean 
scores of the different groups can be adjusted for any pre-existing differences on the dependent 
variable which are present prior to the administration of the experimental treatments. Thus, one 
component of the analysis of covariance computations involves computing adjusted mean values 
for each of the k treatment means. 

When analysis of covariance is employed, the most commonly used covariates are pretest 
scores on the dependent variable or subject variables such as intelligence, anxiety, weight, etc. 
Thus, in Example 21.1, if a researcher believes that the number of nonsense syllables a subject 
learns is a function of verbal intelligence, and if there is a linear correlation between verbal 
intelligence and one’s ability to learn nonsense syllables, verbal intelligence can be employed as 
a covariate. In order to employ it as a covariate, it is necessary to have a verbal intelligence score 
foreach subject who participates in the experiment. By employing the latter scores the researcher 
can use an analysis of covariance to determine whether there are performance differences on the 
dependent variable between the three groups that are independent of verbal intelligence. Later 
in this section an analysis of covariance will be employed to evaluate the data for Example 21.1 
using verbal intelligence as a covariate. 

The optimal design for which an analysis of covariance can be employed is an experiment 
involving a manipulated independent variable in which subjects are randomly assigned to groups, 
and in which scores on the covariate are measured prior to the introduction of the experimental 
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treatments. In such a design the analysis of covariance is able to remove variability on the 
dependent variable that is attributed to differences between the groups on the covariate. However 
when, prior to introducing the experimental treatments, it is known that a strong correlation exists 
between the covariate and the dependent variable, and that the groups are not equal with respect 
to the covariate, the following two options are available to a researcher: a) Subjects can be 
randomly reassigned to groups, after which the researcher can check that the resulting groups are 
equivalent with respect to the covariate; or b) The covariate can be integrated into the study as 
a second independent variable. As will be noted later, some sources endorse the use of either or 
both of the aforementioned strategies as preferable to employing an analysis of covariance. 

A less ideal situation for employing an analysis of covariance is an experiment involving 
a manipulated independent variable in which subjects are randomly assigned to groups, but in 
which scores on the covariate are not measured until after the experimental manipulation. The 
latter situation is more problematical with respect to using the analysis of covariance, since 
subjects’ scores on the covariate could have been influenced by the experimental treatments. 
This latter fact makes it more difficult to interpret the results of the analysis, since, in order to 
draw inferences with regard to cause and effect, the scores on the covariate should be 
independent of the treatments. 

An even more problematical use of the analysis of covariance is for a design in which sub- 
jects are not randomly assigned to groups (which is often referred to as a quasi-experimental 
design). The latter can involve the use of intact groups (such as two different classes at a school) 
who are exposed to a manipulated independent variable, or two groups that are formed on the 
basis of some pre-existing subject characteristic (i.e., an ex post facto study involving a non- 
manipulated independent variable such as gender, race, etc.). It should be noted that if a sub- 
stantial portion of between-groups variability can be explained on the basis of an extraneous 
variable (i.e., a potential covariate), it implies that there was probably some sort of systematic 
bias involved in forming the experimental groups. Because of the latter, it is reasonable to expect 
that if a researcher identifies one extraneous variable that has a substantial correlation with the 
dependent variable, there are probably other extraneous variables whose effects will not be 
controlled for, even if one evaluates the data with an analysis of covariance. In view of this, 
some sources (Keppel and Zedeck (1989) and Lord (1967, 1969)) argue that if subjects are not 
randomly assigned to groups, a researcher will not be able to adequately control for the potential 
effects of other pre-existing extraneous variables on the dependent variable. More specifically, 
they argue that the analysis of covariance will not be able to produce the necessary statistical 
control to allow one to unambiguously interpret the effect of the independent variable on the 
dependent variable. Thus, in designs in which subjects are not randomly assigned to groups, 
sources either state that the analysis of covariance should never be employed, or that if it is 
employed, the results should be interpreted with extreme caution. 

Many sources (e.g., Hinkle et al. (1998) and Keppel (1991)) note that in order for the 
analysis of covariance to provide effective statistical control the following two requirements must 
be met: a) A linear relationship must exist between the dependent variable and the covariate (since 
if they are not linearly related, the adjusted mean values computed for the k experimental con- 
ditions will be biased); and b) The covariate should be independent of (i.e., not be influenced by) 
the experimental treatments. The simplest way to evaluate this latter requirement is to conduct an 
analysis of variance using the covariate as the dependent variable. If the null hypothesis of 
equality between the covariate treatment means is rejected, it will indicate a lack of independence 
between the covariate and the independent variable. Maxwell and Delaney (1990; pp. 380—385) 
provide an excellent description of the conditons that can result in a lack of independence between 
the covariate and independent variable, and the impact it will have on interpreting the results of 
an analysis of covariance. When a lack of independence is present, the results will be most 
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difficult to interpret when assignment of subjects to groups is nonrandom, and/or when the 
treatments effect the covariate in a situation where the covariate is measured after the admin- 
istration ofthe treatments. Even in the ideal situation where the researcher randomly assigns sub- 
jects to the k experimental conditions and measures the covariate before introducing the experi- 
mental treatment, a researcher may detect a significant difference among the groups with respect 
to their means on the covariate. The latter situation could merely be representative of a Type I 
error — in other words, the mean differences on the covariate represent a fluke of chance vari- 
ation. Under such circumstances, to insure minimal ambiguity in interpreting the results of an 
analysis of covariance, the most prudent strategy would probably be to randomly reassign 
subjects until the k covariate treatment means are approximately the same. Maxwell and Delaney 
(1990) note that within the framework of an analysis of covariance, the accuracy of the adjusted 
means which are computed for the dependent variable decreases as the group means on the 
covariate deviate from the grand mean on the covariate. What the latter translates into is (assum- 
ing the covariate is, in fact, independent of the independent variable) that when the treatments 
differ with respect to covariate means, the adjusted values of the dependent variable will be less 
likely to result in a significant difference — in other words, the power of the analysis of co- 
variance to detect a significant effect between the treatment means on the dependent variable will 
be reduced. It is worth noting that some sources (e.g., Maxwell and Delaney (1990)) take the 
position that although problems in interpretation will result if the covariate is not independent of 
the experimental treatments, it does not necessarily mean that under such conditions an analysis 
of covariance cannot yield useful information. 

Kachigan (1986), among others, notes that one should never use an analysis of covariance 
to adjust for between-groups differences with respect to a covariate that are attributable to normal 
sampling error. Aside from the fact that such variability is a part of expected error variability 
within the framework of conducting an analysis of variance, a major reason for not employing 
an analysis of covariance (for which the test statistic is also an omnibus F value) is that the 
number of within-groups degrees of freedom required for the analysis will be one less than 
df yg required for an analysis of variance on the same set of data. In such a case, any reduction 
in error variance associated with the analysis of covariance may be offset by the fact that it will 
require a larger critical F value to reject the null hypothesis than will an analysis of variance on 
the original data. 

Example 21.5 will be employed to illustrate the use of the single-factor between-subjects 
analysis of covariance. Example 21.5 is identical to Example 21.1, except for the fact that the 
covariate of verbal intelligence is included in the analysis. It is because of the latter that the data 
are evaluated with the single-factor between-subjects analysis of covariance. 


Example 21.5 A psychologist conducts a study to determine whether or not noise can inhibit 
learning. Each of 15 subjects is randomly assigned to one of three groups. Each subject is given 
20 minutes to memorize a list of 10 nonsense syllables, which she is told she will be tested on the 
following day. The five subjects assigned to Group 1, the no noise condition, study the list of 
nonsense syllables while they are in a quiet room. The five subjects assigned to Group 2, the 
moderate noise condition, study the list of nonsense syllables while listening to classical music. 
The five subjects assigned to Group 3, the extreme noise condition, study the list of nonsense 
syllables while listening to rock music. The number of nonsense syllables correctly recalled by 
the 15 subjects follow: Group 1: 8, 10, 9, 10, 9; Group 2: 7, 8, 5, 8, 5; Group 3: 4,8, 7, 5, 
7. From previous research it is known that a subject's ability to learn verbal material is highly 
correlated with one's verbal intelligence. As a result of the latter, a test of verbal intelligence 
(for which the maximum possible score is 20) is administered to each subject prior to introducing 
the experimental treatments. The verbal intelligence scores of the subjects follow: Group 1: 14, 


© 2000 by Chapman & Hall/CRC 


16, 15, 16, 15; Group 2: 15, 17, 15, 17, 15; Group 3: 14, 17, 16, 16, 16. Do the data indicate 


that noise influenced subjects' performance? 


The null and alternative hypotheses employed for the single-factor between-subjects 
analysis of covariance are identical to those employed for the single-factor between-subjects 
analysis of variance. The treatment means, however, that are contrasted in the analysis of co- 
variance are adjusted values. As noted earlier, the means are adjusted for the effect of the 
covariate. In view of the fact that the analysis of covariance evaluates adjusted mean values, the 
notation "y will be employed to represent the adjusted value of mean of the population the j ^ 
group represents. The null and alternative hypotheses for the analysis of covariance in reference 
to Example 21.5 are as follows. 


Null hypothesis Hy: by! = aw = m 


(The adjusted mean of the population Group 1 represents equals the adjusted mean of the 
population Group 2 represents equals the adjusted mean of the population Group 3 represents.) 


Alternative hypothesis H: Not H, 


(This indicates there is a difference between at least two of the k = 3 adjusted population 
means.) 


Computational procedures for the single-factor between-subjects analysis of covariance 
This section will describe a number of computational procedures that are used within the 
framework of an analysis of covariance. The reader should keep in mind that the analysis of 
covariance can be employed with experimental designs other than a between-subjects design, and 
that in conducting an analysis of covariance there can be more than one covariate. All of the 
procedures to be described in this section, however, are in reference to the single-factor 
between-subjects analysis of covariance involving a single covariate. 

The data analysis in this section will evaluate Example 21.5. In the latter example the 
number of nonsense syllables correctly recalled represents the dependent variable (which will be 
the Y variable), while the verbal intelligence test scores of subjects represent the covariate (which 
will be the X variable). Initially, an analysis of variance will be conducted on the covariate. The 
latter is generally done in order to demonstrate independence between the covariate and the 
experimental treatments. In the case of Example 21.5, one could argue that it is not necessary 
to conduct an analysis of variance on the covariate, since subjects are randomly assigned to 
groups, and the covariate is measured prior to the introduction of the experimental treatments. 
However, random assignment in and of itself does not insure that the group means on the 
covariate will be equal. As noted earlier, Maxwell and Delaney (1990) state that the accuracy 
of the adjusted means which are computed for the dependent variable will decrease as the group 
means on the covariate deviate from the grand mean on the covariate. Most sources agree that 
if the assignment of subjects to groups is nonrandom, and/or the covariate is measured before or 
during the administration of treatments, an analysis of variance on the covariate should be 
conducted. Although a nonsignificant result for the latter analysis does not guarantee that the 
analysis of covariance (as well as the computed values for the adjusted treatments means) will 
be unbiased, it makes it unlikely. 

After establishing that there are no significant differences between the covariate means, an 
analysis of covariance will be conducted. After doing the latter, adjusted treatment means on the 
dependent variable will be computed for the three groups. The remainder of the section will 
describe or discuss the following: a) Conducting multiple comparisons on the adjusted treatment 
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Table 21.11 Data for Example 21.5 























Group 1 
X, X Y, y? XY, 
14 196 8 64 112 
16 256 10 100 160 
15 225 9 81 135 
16 256 10 100 160 
15 225 9 81 135 
XX, =76 EX? 1158 XY, - 46 EY? =426 LX, Y, =702 
X, 2152 Y = 92 
YXxy 2 yyy 2 
SS, = XX, - CAT cag cuo oos SS, = EY? - CN eui de 29 
n, 5 i n, 5 
(X) Y) (76) (46) 
SP, = SPyy, -OYXY,- —À—— = 702 - ——À = 2.8 
1 
Group 2 
X, X Y, Y X,Y, 
15 225 7 49 105 
17 289 8 64 136 
15 225 5 25 75 
17 289 8 64 136 
15 225 5 25 75 
EX, =79 EX; =1253 EY, =33 EY =27 EX =527 
X, =15.8 Y,- 6.6 
YXxy 2 XYy 2 
ss, = Ex? - § 2”. 1953 - 0X 4g ss, = Ey? - €» -m7 - 09.5, 
k n, 5 2 n, 5 
YX) (LY, 
SPaxy = SPa, = ERY, - E = 527 - me» = 5.6 
2 
Group 3 
X, X; Y, Y; X, Y, 
14 196 4 16 56 
17 289 8 64 136 
16 256 7 49 112 
16 256 3 25 80 
16 256 7 49 112 
XX, = 79 EX, =1253 EY, =31 Ey =203 EX, Y, =49%6 
X, =15.8 Y, = 6.2 
Xxy 2 XYy 2 
ss, = Ex? - 3" 2,4. OM 248 ss, = sy? - Í 9" 993 - OD . 10.8 
3 n, 5 2 n, 5 
ÈX) ÈY) (79) (31) 
SPygy, = SPyy, = EX, - E = 496 - “=== = 62 
3 
EX, = 234 EX; = 3664  XY,- 110 EY; =856 — L(XY), = 1725 
50 OX _ (234)? _ ow ÈY | (110)? _ 
SSr = EX; - = 3664 - “= -136  S$4,-XY;- = 856 - = 49.33 
(EX)QY,) (234)(110) 
SPygy = EQ, - — rH = 1725 0 = 9 
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means; b) Evaluating the homogeneity of regression assumption underlying the analysis of 
covariance; c) Computing measures of magnitude of treatment effect for the analysis of co- 
variance; and d) Computing power for the analysis of covariance. 

Table 21.11 summarizes the data for Example 21.5. As noted above, X represents the 
scores on the covariate, and Y represents the scores on the dependent variable. In the last column 
of Table 21.11 an XY score is computed for each subject. A subject's XY score is obtained by 
multiplying the subject's X score by the subject's Y score. The latter values are employed in 
computing the sum of products, which is represented with notation SP. (The sum of products is 
discussed in greater detail in Section VII of the Pearson product-moment correlation coeffi- 
cient under the discussion of covariance.) The general equation for the sum of products is 

Pyy = XXY - [GCX)(-Y)]/n. All terms in Table 21.11 with the subscript j (e.g., YX, XY, 
etc., where j equals a specific group number) are based on the scores of the n, = 5 subjects in 
the j group, while all terms with the subscript T (e.g., YX, XY. etc.) are based on the scores 
of all N = 15 subjects. 


Analysis of variance on the covariate The procedure for the analysis of variance on the 
covariate is identical to the procedure employed for the analysis of variance for Example 21.1, 
except for the fact that the covariate scores (X) are evaluated instead of the scores on the 
dependent variable (Y). The null and alternative hypotheses employed are identical to those 
employed for Example 21.1, except for the fact that they are stated in reference to the population 
means for the covariate. Thus, H,: Hy, = Hy, = Hy and H,: Not Hy. Computation of the 
sums of squares for the analysis " variance on the covariate are sumimanzed below. 


YXy 2 
-xxi. €» _ 5664 - QMY _ 13.6 
N 15 


k 


Sag zm 


(xy _ (XP 


N 


(76), ON , 9")  Q3AY _ 
5 5 5 15 














n; 
SSyq = S$; - SSpg = 13.6 - 1.2 = 12.4 


Table 21.12 summarizes the analysis of variance on the covariate. The number of degrees 
of freedom employed are the same as those employed for Example 21.1, since the identical 
number of groups and subjects are employed in the analysis. Thus, df,, = 2 and dfyg = 12. 
The tabled critical values employed in Table A10 are F, = 3.89 and F = 6.93. Since the 
computed value F = .58 is less than F,, = 3.89, the null hypothesis cannot be rejected at the 
.05 level. Thus, we can conclude there is no difference between the groups with respect to their 
mean scores on the covariate (which are X, = 15.2, X, = 15.8, X, = 15.8). As noted earlier, 
this is generally interpreted as indicating that the covariate is independent of the experimental 
treatments. 


Table 21.12 Summary Table of Analysis of Variance 
on Covariate for Example 21.5 


Source of variation SS df MS F 
Between-groups 1.20 2 .60 58 
Within-groups 12.40 12 1.03 

Total 13.60 14 
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The analysis of covariance In conducting the analysis of covariance we will employ the sum 
of squares values that were previously computed in the analysis of variance on the dependent 
variable and in the analysis of variance on the covariate. The values that will be employed are 
listed below. Note that the notation for each of the sum of squares values listed indicates whether 
a sum of squares value is for the X variable (the covariate) or the Y variable (the dependent 
variable): a) S$,. = 1.2, which represents the between-groups sum of squares for the covar- 
iate; b) SS, = 12.4, which represents the within-groups sum of squares for the covariate; 
c) SSrx) = 13.6, which represents the total sum of squares for the covariate; d) SS BG) = 26.53, 
which represents the between-groups sum of squares for the dependent variable; e) SS wo) 
= 22.8, which represents the within-groups sum of squares for the dependent variable; and 
f) SS = 49.33, which represents the total sum of squares for the dependent variable. In 
addition to employing the aforementioned values, we must compute the values noted below. 
a) A between-groups sum of products, represented by the notation SP sam; is com- 
puted with Equation 21.52. The notation Y; ,[[G-X) Y] /nj on the right side of Equation 
21.52 indicates that for each group the sum of the scores on the X variable is multiplied by the 
sum of the scores on the Y variable, and the product is divided by the number of subjects in the 
group. Upon doing the latter for all k groups, the k values that have been obtained are summed. 
The value [O7X,)0-Y,)] /N. is subtracted from the resulting sum. The value[(©X,)(XY,)] /N 
is obtained by multiplying the total sum of the scores on the X variable (~X,,) by the total sum 
of the scores on the Y variable (XY) , and dividing the product by the total number of subjects 
(N). Note that unlike a sum of squares value, a sum of products value can be a negative number. 


(Equation 21.52) 
(EXyEY) 


ni 


[7649 . 0963) , cut AOSD exe rc. ET" 
5 5 5 5 


| Exp) 
N 


SP = 


k 
BG(XY) Ei 


J 








SP BG(XY) 


a) A within-groups sum of products, represented by the notation SPy yy, is com- 
puted with Equation 21.53. As noted earlier, the notation LERNE) [nj on the right side 
of Equation 21.53 indicates that for each group the sum of the scores on the X variable is 
multiplied by the sum of the scores on the Y variable, and the product is divided by the number 
of subjects in the group. Upon doing the latter for all k groups, the k values that have been 
obtained are summed. The resulting value is subtracted from the term X(X Y),, which is the sum 
of the XY scores of all N = 15 subjects. 


(Equation 21.53) 
k 
SPwaxy = X(XY), 2 


j=1 


(EXyEY) 








j 


SP 


= 1725 - ae + 0963) , CoD = 1725 - 1710.4 = 14.6 


5 





WG(XY) 


c) A total sum of products, represented by the notation SP yy» is computed with Equa- 
tion 21.54. In computing the total sum of products, the value [(32X MTS Y,)]/N (which, as noted 
earlier, is obtained by multiplying the total sum of the scores on the X variable (£X,) by the 
total sum of the scores on the Y variable (X Y;.), and dividing the product by the total number of 
subjects (N)) is subtracted from the term X(XY )7, which, as noted earlier, is the sum of the XY 
scores of all N = 15 subjects. 
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(Equation 21.54) 


QX)0-Y) 
SPryy = ERP); - —— ———- 


SP = 1725 - 1716 = 9 


- pps . (234)(110) 
15 


T(XY) 


As is the case with the analysis of variance, sum of squares values are computed for the 
analysis of covariance. However, the sums of squares computed for the analysis of covariance 
are adjusted for the effects of the covariate. The analysis of covariance requires that the 
following three adjusted sum of squares values be computed: a) The adjusted total sum of 


squares, which will be represented by the notation SS ap^ b) The adjusted within-groups sum 


of squares, which will be represented by the notation SS wa ap^ and c) The adjusted between- 


groups sum of squares, which will be represented by the notation SS, "E 


The adjusted total sum of squares is computed with Equation 21.55. 


(Equation 21.55) 
md - (SP 


SS. TO) 5$, 


Tadj) 


On 


SS. = 49.33 - ——— = 43.37 
13.6 


T(adj) 


The adjusted within-groups sum of squares is computed with Equation 21.56.% 





(Equation 21.56) 
(SP wax y 
SS wGaj) B SS wo T “ss an 
WG(X) 
7 (14.6)? _ 
SS wGaaj) = 22.8 - Da 5.61 


The adjusted between-groups sum of squares is computed with Equation 21.57. 


(Equation 21.57) 
SS pag z SS Tadj) = SS waaj) 
SS pGaaj) = 43.37 - 5.61 = 37.76 
Table 21.13 summarizes the analysis of covariance. 
Table 21.13 Summary Table of Analysis of Covariance 
for Example 21.5 

Source of variation SS dfaj MS, F 

Between-groups 37.76 2 18.88 37.02 

Within-groups 5.61 11 51 

Total 43.37 13 
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Note that variability for the analysis of covariance is partitioned into between-groups 
variability and within-groups variability, the same two components that variability is partitioned 
into in the analysis of variance. As is the case for an analysis of variance, a mean square is 
computed for each source of variability by dividing the sum of squares by its respective degrees 
of freedom. Equations 21.58-21.60 summarize the computation of the degrees of freedom 
values. Note that in the analysis of covariance, the within-groups (error) degrees of freedom are df veraaiy =N-k -1 
as opposed to the value df; = N - k computed for the analysis of variance. The loss of one 
degree of freedom in the analysis of covariance reflects the use of the covariate as a statistical 
control. Equations 21.61 and 21.62 summarize the computation of the mean square values, and 
Equation 21.63 computes the F ratio for the analysis of covariance. 


dus^k9-1953-152 (Equation 21.58) 
duxmscN-hele1 3-121 (Equation 21.59) 
dfraajy -N-2 (Equation 21.60) 
SS es 
MSpcaa) = sud) _ 37.76 _ 18.88 (Equation 21.61) 
i deced 2 
SS ous 
MS yeaa) ^ D = 8L = Si (Equation 21.62) 
! df icai) 11 
MS, 
Face . AOE = 003 (Equation 21.63) 
MS wsadi 


The obtained value F = 37.02 is evaluated with Table A10, employing as the numerator and 
denominator degrees of freedom df am = Acta 4) = 2 and dfi, = df vc ap = 11. For df... 
-2and dfin = ll,thetabled F,, and F ọọ values are F; = 3.98 and F4, = 7.21. In order 
to reject the null hypothesis, the obtained F value must be equal to or greater than the tabled 
critical value at the prespecified level of significance. Since F = 37.02 is greater than F ,, — 3.98 
and F = 7.21, the alternative hypothesis is supported at both the .05 and .01 levels. 

Note that the value F 2 37.02 computed for the analysis of covariance is substantially larger 
than the value F = 6.98 computed for the analysis of variance in Example 21.1. The larger F 
value for the analysis of covariance can be attributed to the following two factors: a) The 
reduction in error variability in the analysis of covariance. The latter is reflected by the fact that 
the within-groups mean square value for the analysis of covariance MS wea d) = .51 is sub- 
stantially less than the analogous value MS yg = 1.90 computed for the analysis of variance; and 
b) The larger between-groups mean square computed for the analysis of covariance. With 
respect to the latter, in the case of the analysis of covariance MS, d) = 18.88, whereas for the 
analysis of variance MS, = 13.27. 

Thus, when we contrast the result of the analysis of covariance for Example 21.5 with the 
result of the analysis of variance for Example 21.1, we can state that by virtue of controlling for 
variation attributable to the covariate we are able to conduct a more sensitive analysis with 
respect to the impact of the treatments on the dependent variable. As is the case for the analysis 
of variance, it can be concluded that there is a significant difference between at least two of the 
three groups exposed to different levels of noise. This result can be summarized as follows: 
F(2,11) = 37.02, p < .01. 
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Some sources employ the format in Table 21.14 to summarize the results of an analysis of 
covariance. 


Table 21.14 Alternative Summary Table of Analysis of 
Covariance for Example 21.5 


Source of variation SS dfaj MS, F 
Covariate 5.96 1 5.96 11.65 
Between-groups 37.76 2 18.88 37.02 
Within-groups 5.61 11 51 

Total 49.33 14 


Note that there are two differences between Tables 21.13 and 21.14: a) Table 21.14 contains 
a row for variability attributed to the covariate; and b) The total sum of squares and total degrees 
of freedom in the last row of Table 21.14 are different than the values listed in Table 21.13. In 
point of fact, the value for total variability in Table 21.14 (SS; = 49.33) is equal to the total 
variability computed for Table 21.2 (the table for the original analysis of variance, which did not 
include the covariate in the study). In the same respect, the value for total degrees of freedom 
in Table 21.14 (df, = 14) is equal to the total degrees of freedom computed for Table 21.2. In 
Table 21.14 one degree of freedom is employed for the covariate, and consequently the total 
degrees of freedom is equal to df = N — 1 (which is the same value employed for the total 
degrees of freedom for the analysis of variance in Table 21.2). The value SS... = 5.96 is 
computed with Equation 21.64, and Equations 21.65 and 21.66 are employed to compute the 
values MS,,, = 5.96 and F = 11.65 for the covariate. 





SS. SS ~ SSpqgy = 49.33 - 43.37 = 5.96 (Equation 21.64) 
o Seow 5.96 _ . 
MS... = —— = — = 5.96 (Equation 21.65) 
df, 1 
MS 
F=- o 11900... 4 66 (Equation 21.66) 
MS yeaa) 5! 


Note that the F ratio computed for the covariate with Equation 21.66 is not the same F ratio 
computed for the covariate in Table 21.12. Whereas the latter F ratio does nottake the dependent 
variable into account, the value F 2 11.65 computed with Equation 21.66 assumes the presence 
of both a dependent variable and a covariate in the data. Hinkle et al. (1998) note that the value 
F = 11.65 computed for the covariate can be employed to evaluate the null hypothesis 
H): p, = 0, which states that in the underlying population the sample represents, the correlation 
between the scores of subjects on the covariate and the dependent variable equals 0 (the notation 
p is the lower case Greek letter rho, which is employed to represent the population correlation). 
Or to put it another way, the null hypothesis is stating that there is no linear relationship between 
the covariate and the dependent variable. The alternative hypothesis (which is nondirectional) 
that is evaluated is H,: p, * 0. The latter alternative hypothesis states that in the underlying 
population the sample represents, the correlation between the scores of subjects on the covariate 
and the dependent variable does not equal 0. Or to put it another way, that there is a linear rela- 
tionship between the covariate and the dependent variable. 

Since an assumption of the analysis of covariance is that the dependent variable and 
covariate are linearly related, we want to reject the null hypothesis. If the computed F ratio 
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computed with Equation 21.66 is statistically significant, the null hypothesis Hy: pp = 0 can 
be rejected.” The numerator and denominator degrees of freedom employed in the analysis 
of the covariate F ratio are df... = df. = 1 and df, = df vci ap = 11. In Table A10, for 
Gf am = 1 and dfi, = 11, the tabled F, and F ọọ values are F; = 4.84 and F = 9.65. 
Since F = 11.65 is greater than F, = 4.84 and Fo = 9.65, the alternative hypothesis is 
supported at both the .05 and .01 levels. In other words, we can conclude there is a significant 
linear relationship between the dependent variable and the covariate. 

It is possible to compute the following three correlation coefficients from the data in Table 
21.11: a) The overall correlation coefficient based on the total number of scores in the three 
groups. The latter value, represented by the notation r,, is computed in Endnote 69 to be 
r, = .347; b) The within-groups correlation coefficient (represented by the notation r,.), 
which is a weighted average correlation coefficient between the covariate and the dependent 
variable within each of the k = 3 groups. The within-groups correlation coefficient is computed 
with Equation 21.67 to be r,,, = .868. The greater the absolute value of Fg» the greater the 
precision of the analysis of covariance; and c) the between-groups correlation coefficient 
(represented by the notation r,,,), which is a correlation coefficient between the k = 3 treatment 
means on the covariate and the k 2 3 treatment means on the dependent variable. The between- 
groups correlation coefficient is computed (employing Equation 28.1) to be ry; = -.992. In 
computing the latter value, the three pairs of X and Y scores that are substituted in Equation 28.1, 
with n 23, are (X, = 15.2, Y, = 9.2), (X, = 15.8, Y, = 6.6), and (X, = 15.8, Y, = 6.2). 


(Equation 21.67) 


SP. 
" WG(XY) - 14.6 - 868 


"M88, Susa) VOLDS) 


Kirk (1995, p. 719) notes that if the value of rpg is larger than the value of rwg, any 
reduction in error variability has a high likelihood of being offset by a reduction in between- 
groups variability. As a result of the latter, the F value computed for the analysis of covariance 
may actually be lower than the F value computed for the analysis of variance on the dependent 
variable. However, if rp, is negative and ry, is positive (as is the case in our example), the F 
value computed for the analysis of covariance will be greater than the F value computed for the 
analysis of variance on the dependent variable. Winer et al. (1991) note that the higher the value 
of rc, the lower the value of the error term that will be computed for the analysis of covariance 
relative to value of the error term that will be computed if an analysis of variance is conducted 
on the dependent variable. Thus, the greater the absolute value of Fg, the greater the precision 
of the analysis of covariance. 


Computing the adjusted group/treatment means The analysis of covariance is an analysis 
of any variability on the dependent variable that is not accounted for by the covariate. If the 
result of the analysis of covariance is significant, it indicates that two or more treatment means 
differ significantly from one another. However, prior to comparing treatment means it is 
necessary that the latter values be adjusted for the effects of the covariate. In computing the 
adjusted treatment means, we are determining what the scores on the dependent variable would 
be if the groups did not differ on the covariate. Maxwell and Delaney (1990, p. 378) note that 
the adjusted treatment means on the dependent variable can be viewed as estimates ofthe values 
that would be obtained if the mean of the covariate for each of the groups was equal to mean of 
all N subjects on the covariate. 

Equation 21.68 is the general equation for computing an adjusted treatment/group mean. 
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Y, = Y,- by X, - X) (Equation 21.68) 


In Equation 21.68, the adjusted mean on the dependent variable for the j " group is repre- 
sented by the notation Y. /. The notation Y, is the unadjusted mean for the j group on the 
dependent variable (i.e., in the case of Example 21.5, the unadjusted means are the Y. values in 
Table 21.11 which are the same as the group means computed for Example 21.1). Thus, 
Y, = 9.2, Y, = 6.6,and Y, = 6.2. The, value X represents the mean of the j " group on the 
covariate. Thus, as noted in Table 21.11, X, - 15.2, X, - 15.8, and X, - 15.8. The value 
X, is the mean of the N = 15 subjects on the covariate. Thus, X, = (XX, + XX, + LX,)/15 
= (76 + 79 + 79)15 = 15.6. The value b,,., which is referred to as the within-groups 
regression coefficient, is computed with either Equation 21.69 or Equation 21.70. For Example 
21.5, the value by, = 1.18 is computed.” 


ej QUT ip dene ess (Equation 21.69) 


MS wey 1.90 


= 868 103 =1.18 (Equation 21.70) 


WG(X) 


The adjusted group means are computed for the k = 3 groups below. 


= 9.2 - 1.18(15.2 - 15.6) = 9.672 


pa 


- 


= 6.6 - 1.18(15.8 - 15.6) = 6.364 


N 


4 = 6.2 - 1.18(15.8 - 15.6) 


- 


bel 


5.964 


Conducting comparisons among the adjusted group/treatment means The same types of 
comparisons for contrasting group means on the dependent variable that are described in Section 
VI can be employed in contrasting the adjusted group means. Visual inspection of the adjusted 
and unadjusted group means reveals that in the case of Groups 1 and 2 the adjusted means for 
the latter groups are further removed from one another than the unadjusted means. Employing 
the same protocol used in Section VI, the planned simple comparison on the adjusted means 
of Groups 1 and 2 is summarized in Table 21.15. The test statistic is then computed employing 
the appropriate equations presented in Section VI. Note that in the latter equations, the notation 
appropriate for comparing the adjusted mean values is substituted for the notation employed in 
Section VI. 


Table 21.15 Planned Simple Comparison: Group 1 Versus Group 2 


= s; Squared 
Group E Coetticient Product Coefficient 
j (c) (cQ) (cy 

j 
1 9.672 +1 (+1)(9.672) = 49.672 1 
2 6.364 -1 (-1)(6.364) = —6.364 1 
3 5.964 0 (0)(5.964) = 0 0 
X(c) =0 X (c)(Y/) = 3.308 Ye 2 
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Equation 21.17 is employed to compute the value Brain a) = 27.36. 
nèc Y, : li 2 
s nb) _ sQ. 28 - 2136 
comp(adj) 2 
y» 
Equation 21.18 is employed to compute the value MS ompla) - 27.36. 
SS 
MS = comp(adj) _ 27.36 = 27.36 
comp(adj) df. 1 
comp 


The test statistic F., iip is computed with Equation 21.72. Keppel (1990) notes that instead 














of employing MS waa qj 88 the denominator in Equation 21.72, the value computed with Equation 
21.71 for MS), should be employed.” 
(Equation 21.71) 
MS 
MSc = MS wecaayll * aou) sf a 22 | .535 
SS wo) 12.4 
MS comp(aaiy _ 27.36 
- comp(adj) _ " = r 
F omplin = p Mee 51.14 (Equation 21.72) 


Substituting MSc = .535 in Equation 21.72, the value P ndi = 51.14 is computed. 
The numerator and denominator degrees of freedom employed for evaluating the computed F 
value for the comparison are df am = 1 (since it is a single degree of freedom comparison), and 
df, = dfygaa = 11- In Table A10, for dfium = 1 and dfi, = 11, the tabled F, and F og 
values are F'), = 4.84 and F,, = 9.65. Since Prop ^ 51.14 is greater than F'), = 4.84 
and F,, = 9.65, the alternative hypothesis is supported at both the .05 and .01 levels. In other 
words, we can conclude that there is a significant difference between the adjusted means of 
Groups 1 and 2. Note how much larger the value F ipto) = 51.14 is than the value 
Fas = 8.89 computed for the unadjusted Group 1 versus Group 2 comparison in Section VI. 

Equations 21.30 and 21.31 (which are employed for Tukey's HSD test) will be used to 
illustrate an unplanned comparison on the adjusted means of Groups 1 and 2. When Equation 
21.30 is employed, the value g = 10.12 is computed. 


Y -YX : 
q = 1 2 _ 9.672 - 6.364 _ 10.12 
MSc 535 


5 





n 


The obtained value q = 10.12 is evaluated with Table A13. In the latter table, for the 
prespecified probability level, we locate the q value that is in the cell which is the intersection 
of the column for k = 3 means (which represents the total number of groups upon which the 
omnibus F value for the analysis of covariance is based), and the row for df... = 11 (which 
represents the value dfwcaa) 7 11 computed for the analysis of covariance). The tabled critical q o. 
and q p values for k = 3 means and dfo = 11 are qo, = 3.82 and gy, = 5.15. Since the 
obtained value g = 10.12 is greater than the aforementioned tabled critical two-tailed values, the 
nondirectional alternative hypothesis H;: pi + m is supported at both the .05 and .01 levels. 
Note that when the unadjusted means of Group 1 and 2 are evaluated with Tukey's HSD test in 
Section VI, the alternative hypothesis is only supported at the .05 level. 
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When Equation 21.31 is employed to compute the minimum required difference ( CDs, ), 
in order for two means to differ significantly from one another at a prespecified level of sig- 
nificance, the CD, value computed for the adjusted means is less than the value computed in 
Section VI for the unadjusted means. It is demonstrated below that the value CD, = 1.25 
computed at the .05 level for the adjusted means is less than the value CD, = 2.32 computed 
in Section VI for the unadjusted means. It should be emphasized, however, that employing a 
covariate in and of itself does not guarantee a more sensitive analysis. The increased sensitivity 
of the analysis in the case of Example 21.5 is the result of the covariate having a substantial 
linear relationship with the dependent variable. 


/ 


MSwg .535 


CD = (3.82) cs = 1.25 





Hsp ~ dos 


Additional procedures that can be employed for conducting comparisons for an analysis of 
covariance can be found in Kirk (1995) and Winer et al. (1991). 


Evaluation of the homogeneity of regression assumption The homogeneity of regression 
assumption of the analysis of covariance is that within each of the k groups there is a linear 
correlation between the dependent variable and the covariate, and that the k group regression 
lines have the same slope (i.e., are parallel to one another). A full discussion of the concept of 
linear regression and the procedure for determining the slope of a regression line can be found 
in Section VI of the Pearson product-moment correlation coefficient. The reader who is un- 
familiar with the concept of regression is advised to review the latter discussion before 
continuing this section. 

In order to evaluate the homogeneity of regression assumption, the following null and 
alternative hypotheses (which are evaluated with Equation 21.74) are employed. (The notation p; 
in the null hypothesis represents the slope of the regression line of Y on X for the j " group.) 


Null hypothesis H: P, = P, = B 


(In the underlying populations represented by the three groups, the slope of the regression line 
of Y on X for Group 1 equals the slope of the regression line of Y on X for Group 2 equals the 
slope of the regression line of Y on X for Group 3.) 


Alternative hypothesis H,: Not H, 


(This indicates there is a difference between the slopes of at least two of the k = 3 regression 
lines. In order to reject the null hypothesis, the F value computed with Equation 21.74 must be 
equal to or greater than the tabled critical F value at the prespecified level of significance.) 


If the test of the homogeneity of regression assumption results in rejection of the null 
hypothesis, it means that the assumption is violated. Hinkle et al. (1998) and Tabachnick and 
Fidell (1996) note that if the null hypothesis for the homogeneity of regression assumption is 
rejected, it suggests that an interaction is present with regard to the relationship between the 
covariate and the dependent variable. The presence of an interaction means that the relationship 
between the covariate and the dependent variable is not consistent across the different treatments/ 
groups. Consequently, a different adjustment will be required for each of the treatment means 
— in other words, Equation 21.68 cannot be employed to compute the adjusted means for each 
of the groups. Although Keppel (1991, p. 317) cites studies which suggest that the analysis of 
covariance is reasonably robust with respect to violation of the homogeneity of regression 
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assumption, other sources argue that violation of the assumption can seriously compromise the 
reliability of the analysis of covariance. 

Equations 21.73 and 21.74 are employed to evaluate the homogeneity of regression 
assumption. Through use of Equation 21.73, a within-groups regression sum of squares 
(which is explained in Endnote 73) is computed. The latter value is then employed in Equation 
21.74 to compute the test statistic. 


k 
= [ssa - riy) (Equation 21.73) 
j= J 


S wereg 


Where: SS, = = EY? - [GCY, Y/n j- The latter value indicates that the sum of squares for the 
Y variable (the dependent vate is computed for each group. 
Pu is the square of the correlation between sy X variable (the covariate) and the Y 
variable (the dependent variable) within the j " group. 


The notation in Equation 21.73 indicates that for each group, the square of the correlation 
between the covariate and dependent variable is subtracted from 1, and the difference is multi- 
plied by the sum of squares for the Y variable within that group. The value SSyereg | is the sum of 
the values computed for the k groups. 

The values Figy = 1, hgn = .843, and yyy = .861 are computed for each group by 
employing in Equation 28.1 the appropriate summary values for the five X and Y scores for that 
group, or through use of the equation r = = SP ixn’ [SSy SS, (which is another way of expressing 
Equation 28.1). Employing the relevant summary values. in Table 21.11, the three r, "jv y Values 
are computed: 


ray, = Pa SS = 2.8,/0.8)0.8) = 
Lay, = SP! [SS,58,, = 5.6//(4.8)9.2) = .843 
Ban = SS, [SS SS, = 6.2//(4.8)10.8) = .861 


Employing Equation 21.73, the value SS = 5.46 is computed below. 


wgreg 


ss =- [as - ela - ap. 


wgreg 








[am - 683% e»| (1 - C843y) 








" [ss - Gr oP) a - ciem) - 5.46 


Equation 21.74 is employed to compute the test statistic for the homogeneity of regression 
assumption. Two forms of the latter equation are presented below. 
(Equation 21.74) 


Ez (SS aj) 3 SS eres)! (k - 1) - Kn, E DSS oua, 7 SS weree) 
SS pereg! K; - 2) (k - DSS 


wgreg wgreg 





Employing Equation 21.74, the value F = .124 is computed. 
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_ (5.61 - 5.463 - 1) _ 3G - 205.601 - 5.46) 
5.46/[(3)(5 - 2)] (3 - 1)5.46 








= .124 


The obtained value F = .124 is evaluated with Table A10, employing as the numerator 
and denominator degrees of freedom df, = kK - 1 = 2 and dfi, = k(n, -2)=9. For 
df am = 2 and dfin = 9, the tabled F ọ; and F ọọ values are F o = 4.26 and Fy), = 8.02. In 
order to reject the null hypothesis, the obtained F value must be equal to or greater than the 
tabled critical value at the prespecified level of significance. Since F = .124 is less than 
F 5 = 4.26, the null hypothesis cannot be rejected. Thus, we can conclude that the three 


regression lines are homogeneous." 


Estimating the magnitude of treatment effect for the single-factor between-subjects analysis 
of covariance Values for the omega squared and eta squared statistics and Cohen's f index 
computed in Section VI can also be computed for the single-factor between-subjects analysis 
of covariance. The appropriate adjusted values from the analysis of covariance can be sub- 
stituted in any of the Equations 21.41-21.47. Through use of Equations 21.41, 21.43, and 21.45, 
the values of omega squared, eta squared, and Cohen's f index are computed for the analysis 
of covariance. Notice that the computed values à? = .84, fj? = .87, and f = 2.29 are sub- 
stantially larger than the analogous values computed for the magnitude of the treatment effect for 
the analysis of variance. The latter reflects the fact that when the data are adjusted for the effects 
of the covariate, a substantially larger proportion of the variability on the dependent variable is 
associated with the independent variable. 





QA i SS pGGaj) T (k = DMS way E 37.76 = (3 z 1)(.51) E T 
adj mcdio P 5 l ORA E REDE EET I H) 
i SSTaa + MS woraaiy 43.37 + 51 
Ñ = SS paa) NE AO _ 
SS aj 43.37 
-2 
Pa ue ed 


Computation of the power of the single-factor between-subjects analysis of covariance 
Equation 21.38, which is employed to compute the power of the single-factor between-subjects 
analysis of variance, can also be used to compute the power of the single-factor between- 
subjects analysis of covariance. The value Bou is employed in Equation 21.38 in place of 
o If prior data are available from previous studies, the value of MSc, in such studies can 
be used as an estimate of Owon: Cohen (1977, 1988) and Keppel (1991) discusses power 
computations for the analysis of covariance in more detail. 


Final remarks on the analysis of covariance In closing the discussion of analysis of 
covariance, it should be emphasized that the procedure should not be used indiscriminately, since 
a considerable investment of time and effort is required to obtain subjects’ scores on a covariate, 
as well as the fact that the analysis requires laborious computations (although admittedly, the 
latter will not be an issue if one has access to the appropriate computer software). Kirk (1995, 
p. 710) suggests the following with regard to determining the appropriateness of employing an 
analysis of covariance: a) If a researcher is aware of one or more extraneous variables which 
cannot be controlled experimentally that may influence the dependent variable, they can be 
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employed as covariates; and b) The measures obtained for the covariate(s) should not be 
influenced by the experimental treatments under study. The latter condition will be met if any 
of the following is true: 1) The measures of the covariate are obtained prior to introducing the 
experimental treatment; 2) The measures of the covariate are obtained after the introduction of 
the experimental treatment, but before the treatment can have an impact on the covariate; and 
3) Based on prior information, the researcher can assume the covariate is unaffected by the 
experimental treatment. In such a case, the covariate can be measured after the administration 
of the experimental treatment. 

Winer et al. (1991, pp. 787—788) discuss the advantages and disadvantages of employing 
an analysis of covariance versus the use of a factorial design in which the covariate is employed 
as a second independent variable. Factorial designs are discussed in detail in the discussion of 
the between-subjects factorial analysis of variance (Test 27). Winer et al.(1991) note that one 
advantage of employing a factorial design (which is evaluated with the appropriate factorial 
analysis of variance) is that its assumptions are less restrictive than the assumptions for an 
analysis of covariance." Disadvantages associated with a factorial design are that small and/or 
unequal sample sizes may be present within the different levels of the factors, unless the total size 
of the sample is relatively large. Keppel (1991, p. 326) and Kirk (1995, p. 739) discuss the pros 
and cons of employing stratification/blocking, and then conducting the appropriate analysis of 
variance as an alternative to the analysis of covariance. Stratification involves employing the co- 
variate to form homogeneous blocks of subjects, in order to reduce the measure of error 
variability employed in computing the F ratio. The subject of blocking is discussed in Section 
I of the single-factor within-subjects analysis of variance (Test 24). 

For a more detailed discussion of analysis of covariance the reader should consult such 
sources as Hinkel et al. (1998), Keppel (1991), Keppel and Zedeck (1989), Kirk (1995), 
Marascuilo and Levin (1983), Marascuilo and Serlin (1988), Maxwell and Delaney (1990), 
Myers and Well (1995), Tabachnick and Fidell (1996), and Winer et al.(1991). 
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Endnotes 


1. The term single-factor refers to the fact that the design upon which the analysis of variance 
is based involves a single independent variable. Since factor and independent variable 
mean the same thing, multifactor designs (more commonly called factorial designs) that 
are evaluated with the analysis of variance involve more than one independent variable. 
Multifactor analysis of variance procedures are discussed under the between-subjects 
factorial analysis of variance. 


2. It should be noted that if an experiment is confounded, one cannot conclude that a 
significant portion of between-groups variability is attributed to the independent variable. 
This is the case, since if one or more confounding variables systematically vary with the 
levels of the independent variable, a significant difference can be due to a confounding 
variable rather than the independent variable. 


3. The homogeneity of variance assumption is also discussed in Section VI of the ¢ test for 
two independent samples in reference to a design involving two independent samples. 


4. Although it is possible to conduct a directional analysis, such an analysis will not be 
described with respect to the analysis of variance. A discussion of a directional analysis 
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when k = 2 can be found under the ¢ test for two independent samples. In addition, a 
discussion of one-tailed F values can be found in Section VI of the latter test under the 
discussion of the Hartley's F nax test for homogeneity of variance/F test for two popu- 
lation variances. A discussion of the evaluation of a directional alternative hypothesis 
when k > 3 can be found in Section VII of the chi-square goodness-of-fit test (Test 8). 
Although the latter discussion is in reference to analysis of a k independent samples design 
involving categorical data, the general principles regarding the analysis of a directional 
alternative hypothesis when k > 3 are applicable to the analysis of variance. 


5. Some sources present an alternative method for computing SS,. when the number of 
subjects in each group is not equal. Whereas Equation 21.3 weighs each group's contri- 
bution based on the number of subjects in the group, the alternative method (which is not 
generally recommended) weighs each group's contribution equally, irrespective of sample 
size. Keppel (1991), who describes the latter method, notes that as a general rule the value 
it computes for SS, is close to the value obtained with Equation 21.3, except when the 
sample sizes of the groups differ substantially from one another. 


6. Since there are an equal number of subjects in each group, the Equation for SS, can also 
be written as follows: 


SS5G = 


Kao? + GI + GII _ QUO 26,53 


5 


7. SSyg can also be computed with the following equation: 


k 


BS e 


(2X) | 


856 - = 22.80 











(46° | (33? , 6D? 
n, 5 5 5 
Since there are an equal number of subjects in each group, n can be used in place of 
n, in the above equation, as well as in Equation 21.5. The numerators of the term in the 
brackets can be combined and written over a single denominator that equals the value n = 5. 


8. When D, = n, = ny, MS, 


= + & * 5k. Thus, since x Sal, s7 = 2.3, and 
& = 2.7, MS, = CT + 2.3 


my 

(Si 

+ 2.73 = 1.9. 

9. Equation 21.9 can be employed if there are an equal or unequal number of subjects in each 
group. The following equation can also be employed when the number of subjects in each 
group is equal or unequal: df, = (n, - 1) + (m, - 1) +: + (n, - 1). The equation 
dfyg = k(n - 1) = nk - k can be employed to compute dfwg» but only when the num- 
ber of subjects in each group is equal. 


10. When there are an equal number of subjects in each group, since N = nk, df, = nk - 1. 
11. There is a separate F distribution for each combination of df am and dfien values. Figure 
21.2 depicts the F distribution for three different sets of degrees of freedom values. Note 
that in each of the distributions, 5% of the distribution falls to the right of the tabled critical 
F 95 value. 


Most tables of the F distribution do not include tabled critical values for all possible 


values of df um and df... The protocol that is generally employed for determining a critical 
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13. 


14. 


15. 


16. 


17. 


F value for a df value that is not listed is to either employ interpolation or to employ the df 
value closest to the desired df value. Some sources qualify the latter by stating that in order 
to insure that the Type I error rate does not exceed the prespecified value of alpha, one 
should employ the df value that is closest to but not above the desired df value. 





F 9s 


(df =2.20) — 


F osy. $n 2.87 


= 2.45 


Lee 8,20) e 


Figure 21.2 Representative F Distributions for Different df Values 


Although the discussion of comparison procedures in this section will be limited to the 
analysis of variance, the underlying general philosophy can be generalized to any inferential 
statistical test for which comparisons are conducted. 


The terms family and set are employed synonymously throughout this discussion. 


The accuracy of Equation 21.13 will be compromised if all of the comparisons are not 
independent of one another. Independent comparisons, which are commonly referred to 
as orthogonal comparisons, are discussed later in the section. 


Equation 21.14 tends to overestimate the value of a,,,. The degree to which it over- 
estimates & pw increases as either the value of c or «pọ increase. For larger values of c 
and &pç, Equation 21.14 is not very accurate. 

Equation 21.16 provides a computationally quick approximation of the value computed 
with Equation 21.15. The value computed with Equation 21.16 tends to underestimate the 
value of @,.. The larger the value of o,,, or c, the greater the degree c, will be 
underestimated with the latter equation. When a = .05, however, the two equations yield 
values that are almost identical. 


This example is based on a discussion of this issue in Howell (1992, 1997) and Maxwell 
and Delaney (1990). 
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22. 


23. 


24. 


25. 


The term single degree of freedom comparison reflects the fact that k = 2 means are 
contrasted with one another. Although one or both the k 2 2 means may be a composite 
mean that is based on the combined scores of two or more groups, any composite mean is 
expressed as a single mean value. The latter is reflected in the fact that there will always 
one equals sign (=) in the null hypothesis for a single degree of freedom comparison. 


Although the examples illustrating comparisons will assume a nondirectional alternative 
hypothesis, the alternative hypothesis can also be stated directionally. When the alternative 
hypothesis is stated directionally, the tabled critical one-tailed F value must be employed. 
Specifically, when using the F distribution in evaluating a directional alternative 
hypothesis, when «4. = .05, the tabled F „ value is employed for the one-tailed F,, 
value instead of the tabled F,. value (which as noted earlier is employed for the two- 
tailed/nondirectional F,, value). When «,,. = .01, the tabled F ,, value is employed for 
the one-tailed F,, value instead of the tabled Fj, value (which is employed for the two- 
tailed/ nondirectional F,, value). 


If the coefficients of Groups 1 and 2 are reversed (i.e., c, = -1 and c, = +1), the value 
of LX) will equal —2.6. The fact that the sign of the latter value is negative will not 
affect the test statistic for the comparison. This is the case, since, in computing the F value 
for the comparison, the value M(cyX) is squared and, consequently, it becomes irrele- 
vant whether XX) is a positive or negative number. 


When the sample sizes of all k groups are not equal, the value of the harmonic mean 
(which is discussed in Section VI of the ¢ test for two independent samples) is employed 
to represent n in Equation 21.17. However, when the harmonic mean is employed, if there 
are large discrepancies between the sizes of the samples, the accuracy of the analysis may 
be compromised. 


MS yg is employed as the estimate of error variability for the comparison, since, if the 
homogeneity of variance assumption is not violated, the pooled within-groups variability 
employed in computing the omnibus F value will provide the most accurate estimate of 
error variability. 


As is the case with simple comparisons, the alternative hypothesis for a complex planned 
comparison can also be evaluated nondirectionally. 


A reciprocal of a number is the value 1 divided by that number. 


When there are an equal number of subjects in each group it is possible, though rarely done, 
to assign different coefficients to two or more groups on the same side of the equals sign. 
In such an instance the composite mean reflects an unequal weighting of the groups. On 
the other hand, when there are an unequal number of subjects in any of the groups on the 
same side of the equals sign, any groups that do not have the same sample size will be 
assigned a different coefficient. The coefficient a group is assigned will reflect the 
proportion it contributes to the total number of subjects on that side of the equals sign. 
Thus, if Groups 1 and 2 are compared with Group 3, and there are 4 subjects in Group 1 
and 6 subjects in Group 2, there are a total of 10 subjects involved on that side of the equals 
sign. The absolute value of the coefficient assigned to Group 1 will be = = i whereas the 


absolute value assigned to Group 2 will be = = H 
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31. 


When any of the coefficients are fractions, in order to simplify calculations some sources 
prefer to convert the coefficients into integers. In order to do this, each coefficient must be 
multiplied by a least common denominator. A least common denominator is the smallest 
number (excluding 1) that is divisible by all of the denominators of the coefficients. With 
respect to the complex comparison under discussion, the least common denominator is 2, 
since 2 is the smallest number that can be divided by 1 and 2 (which are the denominators 
of the coefficients 1/1 = 1 and 1/2). If all of the coefficients are multiplied by 2, the 
coefficients are converted into the following values: c, = -1, c, = -1,c, = +2. Ifthe 
latter coefficients are employed in the calculations that follow, they will produce the same 
end result as that obtained through use of the coefficients employed in Table 21.4. It should 
be noted, however, that if the converted coefficients are employed, the value Xc, XX) 
= —].7 in Table 21.4 will become twice the value that is presently listed (i.e., it becomes 
3.4). Asaresult of this, X(c))(X)) will no longer represent the difference between the two 
sets of means contrasted in the null hypothesis. Instead, it will be a multiple of that value 
— specifically, the multiple of the value by which the coefficients are multiplied. 


If n, * n,,/(MS,,,/n,) + (MS,,,/n,) is employed as the denominator of Equation 21.22 


and y (Ic? XMSygn,) + (Ec MS wgo)l/n,) is employed as the denominator of Equa- 
tion 21.23. 





If the value 4 (s, 2 In) + (82 /n,) is employed as the denominator of Equations 21.22/21.23, 
or if the degrees of freedom for Equations 21.22/21.23 are computed with Equation 11.4 
(df =n, + n, - 2), a different result will be obtained since: a) Unless the variance for 
all of the groups is equal, it is unlikely that the computed f value will be identical to the one 
computed with Equations 21.22/21.23; and b) If df - n, * n, - 2 is used, the tabled 
critical f value employed will be higher than the tabled critical value employed for 
Equations 21.22/21.23. This is the case, since the df value associated with MS, is larger 
than the df value associated with df = n, + n, - 2. The larger the value of df, the lower 
the corresponding tabled critical t value. 


Equations 21.24 and 21.25 are, respectively, derived from Equations 21.22 and 21.23. 
When two groups are compared with one another, the data may be evaluated with either 
the t distribution or the F distribution. The relationship between the two distributions is 
F = t°. In view of this, the term Fa. wo) in Equations 21.24 and 21.25 can also be 
written as ly . In other words, Equation 21.24 can be written as 


CDi sp = La OMS lin (and, in the case of Equation 21.25, the same equation is 
employed, except for the fact that, inside the radical, (Ecj) is employed in place of 2). 


Thus, in the computations to follow, if a nondirectional alternative hypothesis is employed 
with a =.05, one can employ either Fo; = 4.75 (for df... = 1, dfj., = 12)0r tos = 2.18 
(for df= 12), since (ty, = 2. 18)? = (Fos = 4.75). 


As is noted in reference to linear contrasts, the df= 1 value for the numerator of the F ratio 
for a single degree of freedom comparison is based on the fact that the comparison involves 
k = 2 groups with df = k — 1. In the same respect, in the case of complex comparisons 
there are two sets of means, and thus df = k = iL. 


comp 


As indicated in Endnote 29, if a t value is substituted in Equation 21.24 it will yield the 
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34. 


35. 


36. 


37. 


38. 


39. 


40. 


41. 


same result. Thus, if t,, = 2.18 is employed in the equation CD, s, = f df, /QMS Von, 
WG 


CD oy = 2.18/[2X1.9)]/5 = 1.90. 


The absolute value of t is employed, since a nondirectional alternative hypothesis is 
evaluated. 


When n, # n,, /(MS,/n,) + (MS,,,/n,) is employed in Equation 21.26 in place of 


/QMS, m. 


Since, in atwo-tailed analysis we are interested in a proportion that corresponds to the most 
extreme .0167 cases in the distribution, one-half of the cases (i.e., 0083) falls in each tail 
of the distribution. 


The only exception to this will be when one comparison is being made (i.e., c= 1), in which 
case both methods yield the identical CD value. 


Some sources employ the abbreviation WSD for wholly significant difference instead of 
HSD. 


The value of &,,, for Tukey's HSD test is compared with the value of «,,, for other 
comparison procedures within the framework of the discussion of the Scheffé test later in 
this section. 


When the tabled critical q value for k = 2 treatments is employed, the Studentized range 
statistic will produce equivalent results to those obtained with multiple ¢ tests/Fisher's 
LSD test. This will be demonstrated later in reference to the Newman-Keuls test, which 
is another comparison procedure that employs the Studentized range statistic. 


As is the case with a t value, the sign of q will only be relevant if a directional alternative 
hypothesis is evaluated. When a directional alternative hypothesis is evaluated, in order to 
reject the null hypothesis, the sign of the computed g value must be in the predicted 
direction, and the absolute value of q must be equal to or greater than the tabled critical q 
value at the prespecified level of significance. 


Although Equations 21.30 and 21.31 can be employed for both simple and complex 
comparisons, sources generally agree that Tukey's HSD test should only be employed for 
simple comparisons. Its use with only simple comparisons is based on the fact that in the 
case of complex comparisons it provides an even less powerful test of an alternative 
hypothesis than does the Scheffé test (which is an extremely conservative procedure) 
discussed later in this section. 


When n, * n, and/or the homogeneity of variance assumption is violated, some sources 
recommend using the following modified form of Equation 21.31 (referred to as the Tukey- 
Kramer procedure) for computing CD psp: CD, = dq. dro V MS,,[[G/n,) + (1/ny2]. 
Other sources, however, do not endorse the use of the latter equation and provide alterna- 
tive approaches for dealing with unequal sample sizes. 

In the case of unequal sample sizes, some sources (e.g., Winer et al., 1991) 
recommend employing the harmonic mean of the sample sizes to compute the value of n 
for Equations 21.30 and 21.31 (especially if the sample sizes are approximately the same 
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value). The harmonic mean is described in Section VI of the ¢ test for two independent 
samples. 


The only comparison for which the minimum required difference computed for the 
Newman-Keuls test will equal CD,,,,, will be the comparison contrasting the smallest and 
largest means in the set of k means. One exception to this is a case in which two or more 
treatments have the identical mean value, and that value is either the lowest or highest of 
the treatment means. Although the author has not seen such a configuration discussed in 
the literature, one can argue that in such an instance the number of steps between the lowest 
and highest mean should be some value less than k. One can conceptualize all cases of a 
tie as constituting one step, or perhaps, for those means that are tied, one might employ the 
average of the steps that would be involved if all those means are counted as separate steps. 
The larger the step value that is employed, the more conservative the test. 


If Equation 21.30 is employed, it uses the same values that are employed for Tukey's HSD 
test, and thus yields the identical q value. Thus, q = (9.2 - 6.6)//1.9/5 = 4.22. 


One exception to this involving the Bonferroni-Dunn test will be discussed later in this 
section. 


Recollect that the Bonferroni-Dunn test assumed that a total of 6 comparisons are 
conducted (3 simple comparisons and 3 complex comparisons). Thus, as noted earlier, since 
the number of comparisons exceeds [k(k - 1)]/2 = [33 - 1)]/2 = 3,theScheffé test will 
provide a more powerful test of an alternative hypothesis than the Bonferroni-Dunn test. 


The author is indebted to Scott Maxwell for his input on the content of the discussion to 
follow. 


Dunnett (1964) developed a modified test procedure (described in Winer et al., (1991)) to 
be employed in the event that there is a lack of homogeneity of variance between the 
variance of the control group and the variance of the experimental groups with which it is 
contrasted. 


The philosophy to be presented here can be generalized to any inferential statistical analysis. 


Although the difference between «4, = .05 and &,,, = .10 may seem trivial, a result 
that is declared significant is more likely to be submitted and/or accepted for publication. 


The reader may find it useful to review the discussion of effect size in Section VI of the 
single sample f test and the ¢ test for two independent samples. Effect size indices are 
also discussed in Section IX (the Addendum) of the Pearson product-moment 
correlation coefficient under the discussion of meta-analysis and related topics. 


The reader may want to review the discussion of confidence intervals in Section VI of both 
the single sample ¢ test and the ¢ test for two independent samples. 


The F test for two population variances (discussed in conjunction with the F nax test as 
Test 11a in Section VI of the ¢ test for two independent samples) can only be employed 
to evaluate the homogeneity of variance hypothesis when k = 2. 
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$ should not be confused with the phi coefficient (also represented by the notation 6), 
discussed under the chi-square test for r x c tables (Test 16). Although the latter measure 
is identified by the same symbol, it represents a different measure than the noncentrality 
parameter. 


If a power analysis is conducted after the analysis of variance, the value computed for 
MS yg can be employed to represent aC if the researcher has reason to believe it is a 
reliable estimate of error/with-groups variability in the underlying population. If prior to 
conducting a study reliable data are available from other studies, the MS yg values derived 
in the latter studies can be used as a basis for estimating dees 


When there is no curve that corresponds exactly to dfg, one can either interpolate or 
employ the df, value closest to it. 


Tiku (1967) has derived more detailed power tables that include the alpha values .025 and 
.005. 


When k = 2, Equations 21.41/21.42 and Equation 11.14 will yield the same à» value. 


The following should be noted with respect to the eta squared statistic: a) Earlier in this 
section it was noted that Cohen (1977; 1988, pp. 284—287) employs the values .0099, .0588, 
and .1379 as the lower limits for defining a small versus medium versus large effect size for 
the omega squared statistic. In actuality, Cohen (1977; 1988, pp. 284—287) employs the 
notation for eta squared (i.e., n?) in reference to the aforementioned effect size values. 
However, the definition Cohen (1977; 1988. p. 281) provides for eta squared is identical 
to Equation 21.40 (which is the equation for omega squared). For the latter reason, various 
sources (e.g., Keppel (1991) and Kirk (1995)) employ the values .0099, .0588, and .1379 in 
reference to the omega squared statistic; b) Equation 21.44 (the equation for Adjusted fj?) 
is essentially equivalent to Equation 21.40 (the definitional equation for the population 
parameter omega squared); c) Some sources employ the notation R? for the eta squared 
statistic. The statistic represented by R? or f]? is commonly referred to as the correlation 
ratio, which is the squared multiple correlation coefficient that is computed when multiple 
regression (discussed under the multiple correlation coefficient (Test 28k)) is employed 
to predict subjects’ scores on the dependent variable based on group membership. The use 
of R? in this context reflects the fact that the analysis of variance can be conceptualized 
within the framework of a multiple regression model; d) When k = 2, the eta squared 
statistic is equivalent to r,,, which represents the point-biserial correlation coefficient 
(Test 28h). Under the discussion of Tj the equivalency of fj? and m is demonstrated. 


A number of different measures have been developed to determine the magnitude of 
treatment effect for comparison procedures. Unfortunately, researchers do not agree with 
respect to which of the available measures is most appropriate to employ. Keppel (1991) 
provides a comprehensive discussion of this subject. 


Some sources employ JF, d'a) in Equation 21.48 instead of t df Since t = yF, the two 
values produce equivalent results. The tabled critical two-tailed t value at a prespecified 
level of significance will be equivalent to the square root of the tabled critical F value at the 


same level of significance for df am = 1 and df... = Owe: 
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In using Equation 2.7, sin is equivalent to sy . 
1 


In the interest of precision, Keppel and Zedeck (1989, p. 98) note that although when the 
null hypothesis is true, the median of the sampling distribution for the value of F equals 1, 
the mean of the sampling distribution of F is slightly above one. Specifically, the expected 
value of F = dfyg Idf . 5. lt should also be noted that although it rarely occurs, it is 
possible that MS yg > MS,;. In such a case the value of F will be less than 1 and, ob- 
viously, if F « 1, the result cannot be significant. 

In employing double (or even more than two) summation signs such as D the 
mathematical operations specified are carried out beginning with the summation sign that 
is farthest to the right and continued sequentially with those operations specified by 
summation signs to the left. Specifically, if k = 3 and n = 5, the notation x 
indicates that the sum of the n scores in Group 1 are computed, after which the sum of the 
n scores in Group 2 are computed, after which the sum of the n scores in Group 3 are 
computed. The final result will be the sum of all the aforementioned values that have been 
computed. 


For each of the N = 15 subjects in Table 21.9, the following is true with respect to the 
contribution of a subject's score to the total variability in the data: 


(t; 7X) = 05 - X) 05 - 20) 


ij 


Total deviation score = BG deviation score + WG deviation score 


In evaluating a directional alternative hypothesis, when k = 2 the tabled F,, and F og 
values (for the appropriate degrees of freedom) are, respectively, employed as the one- 
tailed .05 and .01 values. Since the values for F 99 and F g are not listed in Table A10, 
the values F4, = 3.46 and F4, = 8.41 can be obtained by squaring the tabled critical 
one-tailed values t,, = 1.86 and ty, = 2.90, by employing more extensive tables of the 
F distribution available in other sources, or through interpolation. 


In the case of a factorial design it is also possible to have a mixed-effects model. In a 
mixed-effects model it is assumed that at least one of the independent variables is based 
on a fixed-effects model and that at least one is based on a random-effects model. 


Winer et al. (1991) note that the analysis of covariance was originally presented by Fisher 
(1932). 


The equations below are alternative equations that can be employed to compute the values 


SSr dj and SS wa d) The slight discrepancy in values is the result of rounding off error. 


The computation of the values r, and ry, are described at a later point in the discussion 
of the analysis of covariance. 


SS. 


rea) ^ SSyy(l - T7) = 49.33[1 - (347y] = 43.39 


SS - rà) = 22.8[1 - (868)] = 5.62 


WG(adj) ~ SS say 


Through use of Equation 28.1, the sample correlation between the covariate and the 
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dependent variable is computed to be r, = .347. In the computations below, N is em- 
ployed to represent the total number of subjects (in lieu of n, which is employed in Equation 
28.1). The values XX? = 3664 and XY; = 856 are the sums of the XX; and oD 
scores for the k = 3 groups. 


YX, )(LY. 
(XY), - a 1725 — ZAO 
15 


= 347 








15 


“x, (234)? (10)? 
Xx. (2X7) posa - 856 - SS 














The value r = .347 can be computed through use of the equation r = SP,,A/SS,SS, 
(which is discussed at the end of Section IV of the Pearson product-moment correlation 
coefficient). When the values SPrxy) = 9, SS; = 13.6,and SS,,,. = 49.33 (computed 


Us Y) 
in the bottom of Table 21.11) are substituted in the latter equation, the resulting value is 


r = 9//(13.6) (49.33) = .347. 


The equation noted below is an alternative way of computing rwg through use of the 
elements employed in Equation 28.1. The relevant values within each group are pooled in 
the numerator and denominator of the equation. 





702 - cM +1527 - T + 496 - c 





[fiss = a + loss -2 + lass ae [es - 6 + k27 - ox + bos -a| 
5 5 5 5 5 5 


- .868 


If the homogeneity of regression assumption for the analysis of covariance is violated, the 
value computed for byg will result in biased adjusted mean values (i.e., the adjusted mean 
values for the groups will not be accurate estimates of their true values in the underlying 
populations). Evaluation of the homogeneity of variance assumption is discussed later in 
this section. 


Equation 21.71 can also be written as follows: 


MS, 


SS wo) 


MSi = MS + MS 


- $1. 
WG(adj) WG(adj) = 51 + ss - 535 








An alternative way to evaluate the homogeneity of regression assumption is presented by 
Keppel (1991, pp. 317—320), who notes that the adjusted within-groups sum of squares 
( SS wa dj )can be broken down into the following two components: a) The between-groups 


regression sum of squares (SS, reg? which is a source of variability that represents the 
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degree to which the group regression coefficients deviate from the average regression 
coefficient for all of the data; and b) The within-groups regression sum of squares 


(SS pereg) which is a source of variability that represents the degree to which the scores of 


individual subjects deviate from the regression line of the group of which a subject is a 
member. Since SS wa d) is the sum of SS and SS , the value of SS can be 


bgreg wgreg bgreg 
expressed as follows: SS ,ereg = SS - SS 


WG(adj) wgreg 
Although we have already computed the aie of SS 


, an alternative way to 
wgreg 


compute SS, reg | is with the equation noted below. 
k BP 
SS, Te, = x SS ~ jam) 
greg = SS, 
J 








The notation in the above equation indicates that within each group, the result of 
dividing the square of the sum of products (i.e., SP, ja 1$ Squared) by the sum of squares 
for the X variable/the covariate (SS, ) is subtiacted from the sum of squares for the Y 
variable/the dependent variable (SS, jJ The resulting values for the k = 3 groups are then 
summed. The values for SS,, SS, , ‘and SP, ja are listed in Table 21.11 for each group 
at the bottom of the summary information for that group. The computation of SS rus 
is demonstrated below. 




















2 2 2 
SE eene e aga uu ggg os 1 cse 
dia 2.8 4.8 4.8 
Employing the values SS wa 4) = 5.61 and SS nites = 5.46, we can compute that the 
value of SSi,ereg = 15. 
SSyueg = 99wGad) ^ SSwereg = 5-61 - 5.46 = 


A mean square is computed for the between-groups regression element and the within- 
groups regression element as noted below. The degrees of freedom for the between-groups 
regression element is greg = =k - 1 =3 - 1 = 2, and the degrees of freedom for the 
within-groups regression element is df, = k(n - 2) = 35 - 2) = 9. 


wgreg 


SS SS 
MS, oce Lo = 075 Mg, = Le. 946 . G06 
greg C 2 greg df. 


The F ratio for evaluating the homogeneity of regression assumption is computed 
with the equation below. The degrees of freedom employed in evaluating the F ratio 
are df um = dfoereg = = 2 and dfn = ue 9. For df, = 2 and dfin = 9,thetabled 
"o and F,, values are F; = 4.26 and F,, = 8.02. Since F = .124 is less than 

= 4. 26. "the null hypothesis cannot be rej sented The result is identical to that obtained 
d the other method. 


In addition to the assumptions already noted for the analysis of covariance, it also has the 
usual assumptions for the analysis of variance (1.e., normality of the underlying population 
distributions and homogeneity of variance). 
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Test 22 


The Kruskal-Wallis One-Way Analysis 


of Variance by Ranks 
(Nonparametric Test Employed with Ordinal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test In a set of k independent samples (where k > 2), do at least two 
of the samples represent populations with different median values? 


Relevant background information on test The Kruskal-Wallis one-way analysis of vari- 
ance by ranks (Kruskal (1952) and Kruskal and Wallis (1952)) is employed with ordinal (rank- 
order) data in a hypothesis testing situation involving a design with two or more independent 
samples. The test is an extension of the Mann-Whitney U test (Test 12) to a design involving 
more than two independent samples and, when k = 2, the Kruskal-Wallis one-way analysis of 
variance by ranks will yield a result that is equivalent to that obtained with the Mann-Whitney 
U test. If the result of the Kruskal-Wallis one-way analysis of variance by ranks is sig- 
nificant, it indicates there is a significant difference between at least two of the sample medians 
in the set of k medians. As a result of the latter, the researcher can conclude there is a high 
likelihood that at least two of the samples represent populations with different median values. 
In employing the Kruskal-Wallis one-way analysis of variance by ranks one of the 
following is true with regard to the rank-order data that are evaluated: a) The data are in a rank- 
order format, since it is the only format in which scores are available; or b) The data have been 
transformed into a rank-order format from an interval/ratio format, since the researcher has 
reason to believe that one or more of the assumptions of the single-factor between-subjects 
analysis of variance (Test 21) (which is the parametric analog of the Kruskal-Wallis test) are 
saliently violated. It should be noted that when a researcher elects to transform a set of 
interval/ratio data into ranks, information is sacrificed. This latter fact accounts for why there 
is reluctance among some researchers to employ nonparametric tests such as the Kruskal-Wallis 
one-way analysis of variance by ranks, even if there is reason to believe that one or more of 
the assumptions of the single-factor between-subjects analysis of variance have been violated. 
Various sources (e.g. Conover (1980, 1999), Daniel (1990), and Marascuilo and 
McSweeney (1977)) note that the Kruskal-Wallis one-way analysis of variance by ranks is 
based on the following assumptions: a) Each sample has been randomly selected from the 
population it represents; b) The k samples are independent of one another; c) The dependent vari- 
able (which is subsequently ranked) is a continuous random variable. In truth, this assumption, 
which is common to many nonparametric tests, is often not adhered to, in that such tests are often 
employed with a dependent variable that represents a discrete random variable; and d) The 
underlying distributions from which the samples are derived are identical in shape. The shapes 
of the underlying population distributions, however, do not have to be normal. Maxwell and 
Delaney (1990) point out that the assumption of identically shaped distributions implies equal 
dispersion of data within each distribution. Because of this, they note that, like the single-factor 
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between-subjects analysis of variance, the Kruskal-Wallis one-way analysis of variance by 
ranks assumes homogeneity of variance with respect to the underlying population distributions. 
Because the latter assumption is not generally acknowledged for the Kruskal-Wallis one-way 
analysis of variance by ranks, it is not uncommon for sources to state that violation of the 
homogeneity of variance assumption justifies use of the Kruskal-Wallis one-way analysis of 
variance by ranks in lieu of the single-factor between-subjects analysis of variance. It should 
be pointed out, however, that there is some empirical research which suggests that the sampling 
distribution for the Kruskal-Wallis test statistic is not as affected by violation of the homog- 
eneity of variance assumption as is the F distribution (which is the sampling distribution for the 
single-factor between-subjects analysis of variance). One reason cited by various sources for 
employing the Kruskal-Wallis one-way analysis of variance by ranks, is that by virtue of 
ranking interval/ratio data a researcher can reduce or eliminate the impact of outliers. As noted 
in Section VII of the ¢ test for two independent samples, since outliers can dramatically in- 
fluence variability, they can be responsible for heterogeneity of variance between two or more 
samples. In addition, outliers can have a dramatic impact on the value of a sample mean. 

Zimmerman and Zumbo (1993) note that the result obtained with the Kruskal-Wallis one- 
way analysis of variance by ranks is equivalent (in terms of the derived probability value) to 
that which will be obtained if the rank-orders employed for the Kruskal-Wallis test are 
evaluated with a single-factor between-subjects analysis of variance. 


II. Example 


Example 22.1 is identical to Example 21.1 (which is evaluated with the single-factor between- 
subjects analysis of variance). In evaluating Example 22.1 it will be assumed that the ratio data 
(i.e., the number of nonsense syllables correctly recalled) are rank-ordered, since one or more of 
the assumptions of the single-factor between-subjects analysis of variance have been saliently 
violated. 


Example 22.1 A psychologist conducts a study to determine whether or not noise can inhibit 
learning. Each of 15 subjects is randomly assigned to one of three groups. Each subject is given 
20 minutes to memorize a list of 10 nonsense syllables which she is told she will be tested on the 
following day. The five subjects assigned to Group 1, the no noise condition, study the list of 
nonsense syllables while they are in a quiet room. The five subjects assigned to Group 2, the 
moderate noise condition, study the list of nonsense syllables while listening to classical music. 
The five subjects assigned to Group 3, the extreme noise condition, study the list of nonsense 
syllables while listening to rock music. The number of nonsense syllables correctly recalled by 
the 15 subjects follows: Group 1: 8, 10, 9, 10, 9; Group 2: 7,8, 5, 8, 5; Group 3: 4,8, 7, 5, 


7. Do the data indicate that noise influenced subjects’ performance? 
III. Null versus Alternative Hypotheses 


Null hypothesis Hj; 9, = 0, = 9, 


(The median of the population Group 1 represents equals the median of the population Group 
2 represents equals the median of the population Group 3 represents. With respect to the sample 
data, when there are an equal number of subjects in each group, the sums of the ranks will be 
equal for all k groups — i.e., XR, - XR, - XR,. A more general way of stating this (which 
also encompasses designs involving unequal sample sizes) is that the means of the ranks of the 


k groups will be equal (i.e., R, = R, = R3)) 
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Alternative hypothesis H,: Not H, 


(This indicates that there is a difference between at least two of the k 2 3 population medians. 
It is important to note that the alternative hypothesis should not be written as follows: 
H,: 0, # 0, + 0,. The reason why the latter notation for the alternative hypothesis is incorrect 
is because it implies that all three population medians must differ from one another in order to 
reject the null hypothesis. With respect to the sample data, if there are an equal number of sub- 
jects in each group, when the alternative hypothesis is true the sums of the ranks of at least two 
of the k groups will not be equal. A more general way of stating this (which also encompasses 
designs involving unequal sample sizes) is that the means of the ranks of at least two of the k 
groups will not be equal. In this book it will be assumed (unless stated otherwise) that the 
alternative hypothesis for the Kruskal-Wallis one-way analysis of variance by ranks is stated 
nondirectionally.)' 


IV. Test Computations 


The data for Example 22.1 are summarized in Table 22.1. The total number of subjects 
employed in the experiment is N= 15. There are n; =n = Ny, =M = 5 subjects in each group. 
The original interval/ratio scores of the subjects are recorded in the columns labelled X, , X,, and 
X,. The adjacent columns R,, R,, and R, contain the rank-order assigned to each of the 
scores. The rankings for Example 22.1 are summarized in Table 22.2. 

The ranking protocol employed for the Kruskal-Wallis one-way analysis of variance by 
ranks is the same as that employed for the Mann-Whitney U test. In Table 22.2 the two-digit 
subject identification number indicates the order in which a subject’s score appears in Table 22.1 
followed by his/her group. Thus, Subject i, j is the i” subject in Group j. 


Table 22.1 Data for Example 22.1 


Group 1 Group 2 Group 3 
X, R, X, R, X, R, 
8 9.5 7 6 4 1 
10 14.5 8 9.5 8 9.5 
9 12.5 5 3 7 6 
10 14.5 8 9.5 5 3 
9 12.5 5 3 7 6 
YR, = 63.5 ER, = 31 ER, = 25.5 
> MR =. XR = XR 
Ros 21599 sig7 R, = 2. 31 65 R, = 3.255 _54 
n, 5 n, 5 n, 5 


Table 22.2 Rankings for the Kruskal-Wallis Test for Example 22.1 
Subject 
identification 13 32 52 43 12 33 53 11 22 42 23 3,1 5,1 2,1 4,1 
number 
Nuinber à 5 5 57 7 7 8 & 8 8 9 9 10 10 
correct 


Rank prior totie 4c >% 3 4 5 6 7 8 9 10 dE 12 13 14 15 
adjustment 


kar 1 3 3 3 6 6 6 95 95 95 95 125 12.5 145 14.5 
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A brief summary of the ranking protocol employed in Table 22.2 follows: 

a) All N= 15 scores are arranged in order of magnitude (irrespective of group membership), 
beginning on the left with the lowest score and moving to the right as scores increase. This is 
done in the second row of Table 22.2. 

b) In the third row of Table 22.2, all N = 15 scores are assigned a rank. Moving from left 
to right, a rank of 1 is assigned to the score that is furthest to the left (which is the lowest score), 
a rank of 2 is assigned to the score that is second from the left (which, if there are no ties, will 
be the second lowest score), and so on until the score at the extreme right (which will be the 
highest score) is assigned a rank equal to N (if there are no ties for the highest score). 

c) In instances where two or more subjects have the same score, the average of the ranks 
involved is assigned to all scores tied for a given rank. The tie-adjusted ranks are listed in the 
fourth row of Table 22.2. Fora comprehensive discussion of how to handle tied ranks, the reader 
should review the description of the ranking protocol in Section IV of the Mann-Whitney U 
test. 

It should be noted that, as is the case with the Mann-Whitney U test, it is permissible to 
reverse the ranking protocol described above. Specifically, one can assign a rank of 1 to the 
highest score, a rank of 2 to the second highest score, and so on, until reaching the lowest score 
which is assigned a rank equal to the value of N. This reverse ranking protocol will yield the 
same value for the Kruskal-Wallis test statistic as the protocol employed in Table 22.2. 

Upon rank-ordering the scores of the N = 15 subjects, the sum of the ranks is computed for 
each group. In Table 22.1 the sum of the ranks of the j " group is represented by the notation 
YR. Thus, XR, = 63.5, XR, = 31, XR, = 25.5. 

The chi-square distribution is used to approximate the Kruskal-Wallis test statistic. Equa- 
tion 22.1 is employed to compute the chi-square approximation of the Kruskal-Wallis test 
statistic (which is represented in most sources by the notation H). 


k 


NW DA 


(XR)! 
- 3(N + 1) (Equation 22.1) 





j um 37 








j 


Note that in Equation 22.1, the term LAER, yl nj indicates that for each of the k 
groups, the sum of the ranks is squared and then divided by the number of subjects in the group. 
Upon doing this for all k groups, the resulting values are summed. Substituting the appropriate 
values from Example 22.1 in Equation 22.1, the value H = 8.44 is computed. 


12 


i 63-5} , GD? 
(15X15 + 1) 


5 5 








ges - GY15 + 1) = 844 


V. Interpretation of the Test Results 


In order to reject the null hypothesis, the computed value H = y? must be equal to or greater 
than the tabled critical chi-square value at the prespecified level of significance. The computed 
chi-square value is evaluated with Table A4 (Table of the Chi-Square Distribution) in the 
Appendix. For the appropriate degrees of freedom, the tabled Xs value (which is the chi-square 
value at the 95th percentile) and the tabled Xo value (which is the chi-square value at the 99th 
percentile) are employed as the .05 and .01 critical values for evaluating a nondirectional alter- 
native hypothesis. The number of degrees of freedom employed in the analysis are computed 
with Equation 22.2. Thus, df23-1z2 
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df=k-1 Equation 22.2) 


For df = 2, the tabled critical .05 and .01 chi-square values are Xas = 5.99 and 
Xo = 9.21. Since the computed value H = 8.44 is greater than Xs = 5.99, the alterna- 
tive hypothesis is supported at the .05 level. Since, however, H = 8.44 islessthan Xi = 9.21, 
the alternative hypothesis is not supported at the .01 level? A summary of the analysis of 
Example 22.1 with the Kruskal-Wallis one-way analysis of variance by ranks follows: It can 
be concluded that there is a significant difference between at least two of the three groups 
exposed to different levels of noise. This result can be summarized as follows: H(2) = 8.44, 
p<.05. 

It should be noted that when the data for Example 22.1 are evaluated with a single-factor 
between-subjects analysis of variance, the null hypothesis can be rejected at both the .05 and 
.01 levels (although it barely achieves significance at the latter level and, in the case of the 
Kruskal-Wallis test, the result just falls short of significance at the .01 level). The slight 
discrepancy between the results of the two tests reflects the fact that, as a general rule (assuming 
that none of the assumptions of the analysis of variance are saliently violated), the Kruskal- 
Wallis one-way analysis of variance by ranks provides a less powerful test of an alternative 
hypothesis than the single-factor between-subjects analysis of variance. 


VI. Additional Analytical Procedures for the Kruskal-Wallis 
One-Way Analysis of Variance by Ranks and/or Related Tests 


1. Tie correction for the Kruskal-Wallis one-way analysis of variance by ranks Some 
sources recommend that if there is an excessive number of ties in the overall distribution of N 
scores, the value of the Kruskal-Wallis test statistic be adjusted. The tie correction results in 
a small increase in the value of H (thus providing a slightly more powerful test of the alternative 
hypothesis). Equation 22.3 is employed to compute the value C, which represents the tie cor- 
rection factor for the Kruskal-Wallis one-way analysis of variance by ranks. 


Y t- t) 
Fem 


Csabai 
N? -N 


(Equation 22.3) 


Where: s = The number of sets of ties 
t, = The number of tied scores in the i” set of ties 


The notation E E - t,) indicates the following: a) For each set of ties, the number of 
ties in the set is subtracted from the cube of the number of ties in that set; and b) The sum of all 
the values computed in part a) is obtained. The correction for ties will now be computed for 
Example 22.1. In the latter example there are s = 5 sets of ties (i.e., three scores of 5, three 
scores of 7, four scores of 8, two scores of 9, and two scores of 10). Thus: 


Ya -= IG? - 3] + IG? - 3] + I4? - 4] + IQ? - 2] + IQ - 2] = 120 
i=l 
Employing Equation 22.3, the value C = .964 is computed. 


ie ste es ig 


» (15)? - 15 
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Ho, which represents the tie-corrected value of the Kruskal-Wallis test statistic, is 
computed with Equation 22.4. 


H. = = (Equation 22.4) 


As is the case with H = 8.44 computed with Equation 22.1, the value Ho = 8.76 
computed with Equation 22.4 is significant at the .05 level (since it is greater than Xbs = 5.99), 
but is not significant at the .01 level (since it is less than Xii = 9.21). Although Equation 22.4 
results in a slightly less conservative test than Equation 22.1, in this instance the two equations 
lead to identical conclusions with respect to the null hypothesis. 


2. Pairwise comparisons following computation of the test statistic for the Kruskal-Wallis 
one-way analysis of variance by ranks Prior to reading this section the reader should review 
the discussion of comparisons in Section VI of the single-factor between-subjects analysis of 
variance. As is the case with the omnibus F value computed for the single-factor between- 
subjects analysis of variance, the H value computed with Equation 22.1 is based on an evalu- 
ation of all k groups. When the value of H is significant, it does not indicate whether just two 
or, in fact, more than two groups differ significantly from one another. In order to answer the 
latter question, it is necessary to conduct comparisons contrasting specific groups with one 
another. This section will describe methodologies that can be employed for conducting simple/ 
pairwise comparisons following the computation of an H value.’ 

In conducting a simple comparison, the null hypothesis and nondirectional alternative hy- 
pothesis are as follows: H,: 0, = 0, versus H,: 0, + 0,. In the aforementioned hypotheses, 
0, and 0, represent the medians of the populations represented by the two groups involved 
in the comparison. The alternative hypothesis can also be stated directionally as follows: 
H: 0, > O0,orH;: 90, < B, 

Various sources (e.g., Daniel (1990) and Siegel and Castellan (1988)) describe a comparison 
procedure for the Kruskal-Wallis one-way analysis of variance by ranks (described by Dunn 
(1964)), which is essentially the application of the Bonferroni-Dunn method described in Section 
VI of the single-factor between-subjects analysis of varianceto the Kruskal-Wallis test model. 
Through use of Equation 22.5, the procedure allows a researcher to identify the minimum required 
difference between the means of the ranks of any two groups (designated as CD pw) in order for 
them to differ from one another at the prespecified level of significance." 





Day = Zai; c : | - + H (Equation 22.5) 


a b 


Where: n, and n, represent the number of subjects in each of the groups involved in the 
simple comparison 


The value of Za is obtained from Table A1 (Table of the Normal Distribution) in the 
Appendix. In the case of a nondirectional alternative hypothesis, z, dj is the z value above which 
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a proportion of cases corresponding to the value a,,,/2c falls (where c is the total number of 
comparisons that are conducted). In the case of a directional alternative hypothesis, z, dj is the 
z value above which a proportion of cases corresponding to the value «,,,/c falls. When all 
possible pairwise comparisons are made c = [k(k — 1)]/2, and thus, 2c = k(k — 1). In Example 
22.1 the number of pairwise/simple comparisons that can be conducted are c = [3(3 - 1)]/2 = 3 
— specifically, Group 1 versus Group 2, Group 1 versus Group 3, and Group 2 versus Group 3. 

The value of z,,. will be a function of both the maximum familywise Type I error rate 
(,-y,) the researcher is willing to tolerate and the total number of comparisons that are con- 
ducted. When a limited number of comparisons are planned prior to collecting the data, most 
sources take the position that a researcher is not obliged to control the value of &,,,. In such a 
case, the per comparison Type I error rate («,,.) will be equal to the prespecified value of 
alpha. When c, is not adjusted, the value of z, d employed in Equation 22.5 will be the tabled 
critical z value that corresponds to the prespecified level of significance. Thus, if a 
nondirectional alternative hypothesis is employed and « = a, = .05, the tabled critical two- 
tailed .05 value z,, = 1.96 is used to represent 2 adj in Equation 22.5. If @ = a. = .01, the 
tabled critical two-tailed .01 value z,, = 2.58 is used in Equation 22.5. In the same respect, if 
a directional alternative hypothesis is employed, the tabled critical .05 and .01 one-tailed values z 5, = 1.65 
and zy, = 2.33 are used for 2 adj in Equation 22.5. 

When comparisons are not planned beforehand, it is generally acknowledged that the value 
of «,,,, must be controlled so as not to become excessive. The general approach for controlling 
the latter value is to establish a per comparison Type I error rate which insures that & pẹ will 
not exceed some maximum value stipulated by the researcher. One method for doing this 
(described under the single-factor between-subjects analysis of variance as the Bonferroni- 
Dunn method) establishes the per comparison Type I error rate by dividing the maximum 
value one will tolerate for the familywise Type I error rate by the total number of comparisons 
conducted. Thus, in Example 22.1, if one intends to conduct all three pairwise comparisons and 
wants to insure that 0,,, does not exceed .05, Gp. = Opy/c = .05/3 = .0167. The latter 
proportion is used to determine the value of z, i As noted earlier, if a directional alternative 
hypothesis is employed for a comparison, the value of z, d employed in Equation 22.5 is the z 
value above which a proportion equal to «pç = «,,,/c of the cases falls. In Table A1, the z 
value that corresponds to the proportion .0167 is z = 2.13. By employing z, d^ 2.13 in 
Equation 22.5, one can be assured that within the "family" of three pairwise comparisons, 
€, Will not exceed .05 (assuming all of the comparisons are directional). If a nondirectional 
alternative hypothesis is employed for all of the comparisons, the value of Z,,, will be the z value 
above which a proportion equal to @,,,/2c = @,./2 of the cases falls. Since 
Op-/2 = .0167/2 = .0083, z= 2.39. By employing Zaj 7 2.39 in Equation 22.5, one can be 
assured that «pwy will not exceed 05. 


Table 22.3 Difference Scores Between Pairs 
of Mean Ranks for Example 22.1 


IR, - R,| = |12.7 - 62| = 6.5 
IR, - R| = [12.7 - 5.1] = 7.6 


= |6.2- 5.1] - 141 


In order to employ the CD xw value computed with Equation 22.5, it is necessary to de- 
termine the mean rank for each of the k groups, and then compute the absolute value of the 
difference between the mean ranks of each pair of groups that are compared. In Table 22.1 the 
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following values for the mean ranks of the groups are computed: R, = 12.7, R, = 6.2, 
R, = 5.1. Employing the latter values, Table 22.3 summarizes the difference scores between 
pairs of mean ranks. 

If any of the differences between mean ranks is equal to or greater than the CD pw value 
computed with Equation 22.5, a comparison is declared significant. Equation 22.5 will now be 
employed to evaluate the nondirectional alternative hypothesis H,: 0, # 0, for all three pair- 
wise comparisons. Since it will be assumed that the comparisons are unplanned and that the 
researcher does not want the value of «,,, to exceed .05, the value z, dj = 2.39 will be used in 
computing CD xw- 


(15)(15_ + 1) 


CD py = (2.39) = 


1 1 
Soe ee = 2: 2. = 6.7 
| 5 : (2.39)(2.83) = 6.76 


The obtained value CD, = 6.76 indicates that any difference between the mean ranks of 
two groups that is equal to or greater than 6.76 is significant. With respect to the three pairwise 
comparisons, the only difference between mean ranks which is greater than CD,y, = 6.76 is 

|R, - R,| = 7.6. Thus, we can conclude there is a significant difference between the 

performance of Group | and Group 3. Note that although |R, - R,| = 6.5 is close to 
CD, = 6.76, it is not statistically significant unless the researcher is willing to tolerate a 
familywise error rate slightly above .05.’ 

An alternative strategy that can be employed for conducting pairwise comparisons for the 
Kruskal-Wallis test model is to use the Mann-Whitney U test for each comparison. Use of 
the latter test requires that the data for each pair of groups to be compared be rank-ordered, and 
that a separate U value be computed for that comparison. The exact distribution of the Mann- 
Whitney test statistic can only be used when the value of «,, is equal to one of the probabil- 
ities documented in Table A11 (Table of Critical Values for the Mann-Whitney U Statistic) 
in the Appendix. When a, is a value other than those listed in Table A11, the normal 
approximation of the Mann-Whitney U test statistic must be employed. 

When the Mann-Whitney U test is employed for the three pairwise comparisons, the 
following U values are computed: a) Group 1 versus Group 2: U= 1; b) Group 1 versus Group 
3: U = .5; and c) Group 2 versus Group 3: U = 10. When the aforementioned U values are 
substituted in Equations 12.4 and 12.5 (the uncorrected and continuity-corrected normal approxi- 
mations for the Mann-Whitney U test), the following absolute z values are computed: a) Group 
1 versus Group 2: z = 2.40 and z = 2.30; b) Group 1 versus Group 3: z = 2.51 and z = 2.40; 
and c) Group 2 versus Group 3: z = .52 and z = .42. If we want to evaluate a nondirectional 
alternative hypothesis and insure that & pẹ does not exceed .05, the value of a. is set equal to 
.0167. Table A11 cannot be employed, since it does not list two-tailed critical U values for 
© o1¢7- In order to evaluate the result of the normal approximation, we identify the tabled critical 
two-tailed .0167 z value in Table A1. In employing Equation 22.5 earlier in this section, we 
determined that the latter value is Z,,,, = 2.39. Since the uncorrected values z = 2.40 (for 
the comparison Group 1 versus Group 2) and z = 2.51 (for the comparison Group 1 versus 
Group 3) computed with Equation 9.4 are greater than z 44, = 2.39, the latter two comparisons 
are significant if we wish to insure that & pẹ does not exceed .05. If the correction for continuity 
is employed, only the value z = 2.40 (for the comparison Group 1 versus Group 3), computed 
with Equation 12.5 is significant, since it exceeds Z,,,, = 2.39. The Group 1 versus Group 2 
comparison falls short of significance, since z = 2.30 is less than Z pẹ; = 2.39. Recollect that 
when Equation 22.5 is employed to conduct the same set of comparisons, the Group | versus 
Group 3 comparison is significant, whereas the Group 1 versus Group 2 comparison falls just 
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short of significance. Thus, the result obtained with Equation 22.5 is identical to that obtained 
when the continuity-corrected normal approximation of the Mann-Whitney U test is employed. 

In the event the researcher elects not to control the value of &.,, and employs «pe = .05 
in evaluating the three pairwise comparisons (once again assuming a nondirectional analysis), 
both the Group 1 versus Group 2 and Group 1 versus Group 3 comparisons are significant at the 
.05 level, regardless of which comparison procedure is employed. Specifically, both the un- 
corrected and corrected normal approximations are significant, since z = 2.40 and z = 2.30 
(computed for the comparison Group 1 versus Group 2) and z = 2.51 and z = 2.40 (computed 
for the comparison Group 1 versus Group 3) are greater than the tabled critical two-tailed value 
Zos = 1.96. Employing Table A11, we also determine that both the Group 1 versus Group 2 
and Group 1 versus Group 3 comparisons are significant at the .05 level, since the computed 
values U = 1 and U = .5 are less than the tabled critical two-tailed .05 value U,, = 2 (based on 
n, = 5 and n, = 5). If Equation 22.5 is employed for the same set of comparisons, CD pw 
= (1.96)(2.83) = 5.55.* Thus, if the latter equation is employed, the Group 1 versus Group 2 and 
Group 1 versus Group 3 comparisons are significant, since in both instances the difference 
between the mean ranks is greater than CD, = 5.55. 

The above discussion of comparisons illustrates that, generally speaking, the results 
obtained with Equation 22.5 and the Mann-Whitney U test (as well as other comparison 
procedures that have been developed for the Kruskal-Wallis one-way analysis of variance) 
will be reasonably consistent with one another. As noted in Endnote 7 (as well as in the 
discussion of comparisons in Section VI of the single-factor between-subjects analysis of 
variance), in instances where two or more comparison procedures yield inconsistent results, the 
most effective way to clarify the status of the null hypothesis is to replicate a study one or more 
times. Inthe final analysis, the decision regarding which of the available comparison procedures 
to employ is usually not the most important issue facing the researcher conducting comparisons. 
The main issue is what maximum value one is willing to tolerate for &,,,. Additional sources 
on comparison procedures for the Kruskal-Wallis test model are Marascuilo and McSweeney 
(1977) (who describe a methodology for conducting complex comparisons), Wike (1978) (who 
provides a comparative analysis of a number of different procedures), and Hollander and Wolfe 
(1999). 

The same logic employed for computing a confidence interval for a comparison described 
in Section VI of the single-factor between-subjects analysis of variance can be employed to 
compute a confidence interval for the Kruskal-Wallis test model. Specifically: Add to and 
subtract the computed value of CD,,, from the obtained difference between the two mean 
ranks involved in a comparison. Thus, CI}; (based on 0,4, = .05) for the comparison Group 
1 versus Group 3 is computed as follows: CI, = 7.6 + 6.76. In other words, the researcher 
can be 95% confident (or the probability is .95) that the mean of the ranks in the population 
represented by Group 1 is between .84 and 14.36 units larger than the mean of the ranks in the 
population represented by Group 3. Marascuilo and McSweeney (1977) provide a more detailed 
discussion of the computation of a confidence interval for the Kruskal-Wallis test model. 


VII. Additional Discussion of the Kruskal-Wallis One-Way Analysis 
of Variance by Ranks 


1. Exact tables of the Kruskal-Wallis distribution Although an exact probability value can 
be computed for obtaining a configuration of ranks that is equivalent to or more extreme than the 
configuration observed in the data evaluated with the Kruskal-Wallis one-way analysis of 
variance by ranks, the chi-square distribution is generally employed to estimate the latter prob- 
ability. As the values of k and N increase, the chi-square distribution provides a more accurate 
estimate of the exact Kruskal-Wallis distribution. Although most sources employ the chi-square 
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approximation regardless of the values of k and N, some sources recommend that exact tables 
be employed under certain conditions. Beyer (1968), Daniel (1990), and Siegel and Castellan 
(1988) provide exact Kruskal-Wallis probabilities for whenever k = 3 and the number of sub- 
jects in any of the samples is five or less. Use of the chi-square distribution for small sample 
sizes will generally result in a slight decrease in the power of the test (1.e., there is a higher 
likelihood of retaining a false null hypothesis). Thus, for small sample sizes, the tabled critical 
chi-square value should, in actuality, be a little lower than the value listed in Table A4. 

In point of fact, the exact tabled critical H values for k = 3 and n, = 5 are Hy = 5.78 
and H,, = 7.98. If the latter critical values are employed, the value H = 8.44 computed for 
Example 22.1 is significant at both the .05 and .01 levels, since H = 8.44 is greater than both 
Hg = 5.78 and H = 7.98. Thus, in this instance the exact tables and the chi-square 
approximation do not yield identical results. 


2. Equivalency of the Kruskal-Wallis one-way analysis of variance by ranks and the 
Mann-Whitney U test when k = 2 In Section I it is noted that when k = 2 the Kruskal- 
Wallis one-way analysis of variance by ranks will yield a result that is equivalent to that 
obtained with the Mann-Whitney U test. To be more specific, the Kruskal-Wallis test will 
yield a result that is equivalent to the normal approximation for the Mann-Whitney U test when 
the correction for continuity is not employed (i.e., the result obtained with Equation 12.4). In 
order to demonstrate the equivalency of the two tests, Equation 22.1 is employed below to 
analyze the data for Example 12.1, which was previously evaluated with the Mann-Whitney 
U test. 
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Employing Equation 22.2, df = 2— 1 = 1. For df= 1, the tabled critical .05 and .01 chi- 
square values are Xs - 3.84 and Xi = 6.63. Since the obtained value H = 3.15 is less 
than Xs = 3.84, the null hypothesis cannot be rejected. 

Equation 12.4 yields the value z = —1.78 for the same set of data. When the latter value 
is squared, it yields a value that is equal to the H (chi-square) value computed with Equation 22.1 
(Le. (z = -1.78) = (y? = 3.15))? Itis also the case that the square of the tabled critical z 
value employed for the normal approximation of the Mann-Whitney U test will always be equal 
to the tabled critical chi-square value employed for the Kruskal-Wallis test at the same level 
of significance. Thus, the square of the tabled critical two-tailed value z,, = 1.96 employed 
for the normal approximation of the Mann-Whitney U test equals Xs - 3.84 employed for 
the Kruskal-Wallis test (i.e., (z = 1.96 = (y? = 3.84)). 


3. Power-efficiency of the Kruskal-Wallis one-way analysis of variance by ranks When 
the underlying population distributions are normal, the asymptotic relative efficiency (which 
is discussed in Section VII of the Wilcoxon signed-ranks test (Test 6)) of the Kruskal- 
Wallis one-way analysis of variance by ranks is .955 (when contrasted with the single-factor 
between-subjects analysis of variance). For population distributions that are not normal, the 
asymptotic relative efficiency of the Kruskal-Wallis test is generally equal to or greater than 
1. Asa general rule, proponents of nonparametric tests take the position that when a researcher 
has reason to believe that the normality assumption of the single-factor between-subjects 
analysis of variance has been saliently violated, the Kruskal-Wallis one-way analysis of 
variance by ranks provides a powerful test of the comparable alternative hypothesis. 
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4. Alternative nonparametric rank-order procedures for evaluating a design involving k 
independent samples In addition to the Kruskal-Wallis one-way analysis of variance by 
ranks, a number of other nonparametric procedures for two or more independent samples have 
been developed that can be employed with ordinal data. Among the more commonly cited 
alternative procedures are the following: a) The van der Waerden normal-scores test for k 
independent samples (Test 23) (Van der Waerden (1953/1953)), which is described in the next 
chapter, as well as alternative normal-scores tests developed by Terry and Hoeffding (Terry 
(1952)) and Bell and Doksum (1965); b) The Jonckheere-Terpstra test for ordered alterna- 
tives (Jonckheere (1954); Terpstra (1952)) can be employed when the alternative hypothesis for 
a k independent samples design specifies the rank-order of the k population medians. The latter 
test is described in Daniel (1990) and Hollander and Wolfe (1999); c) The median test for 
independent samples (Test 16e) (discussed in Section VI of the chi-square test for r x c tables 
(Test 16)) can be extended to three or more independent samples by dichotomizing k samples 
with respect to their median values; and d) The Kolmogorov-Smirnov test for two 
independent samples (Test 13) (Kolmogorov (1933) and Smirnov (1939)) can be extended to 
three or more independent samples. The use of the latter test with more than two samples is 
discussed in Bradley (1968) and Conover (1980, 1999). Sheskin (1984) describes some of the 
aforementioned tests in greater detail, as well as citing additional procedures that can be 
employed for k independent samples designs involving rank-order data. 


VIII. Additional Examples Illustrating the Use of the Kruskal- 
Wallis One-Way Analysis of Variance by Ranks 


The Kruskal-Wallis one-way analysis of variance by ranks can be employed to evaluate any 
of the additional examples noted for the single-factor between-subjects analysis of variance, 
if the data for the latter examples are rank-ordered. In addition, the Kruskal-Wallis test can be 
used to evaluate the data for any of the additional examples noted for the ¢ test for two inde- 
pendent samples (Test 11) and the Mann-Whitney U test. Examples 22.2 and 22.3 are two 
additional examples that can be evaluated with the Kruskal-Wallis one-way analysis of vari- 
ance by ranks. Example 22.2 is an extension of Example 12.2 (evaluated with the Mann- 
Whitney U test) to a design involving k = 3 groups. In Example 22.2 (as well as Example 22.3) 
the original data are presented as ranks, rather than in an interval/ratio format." Since the rank- 
orderings for Example 22.2 are identical to those employed in Example 22.1, it yields the same 
result. Example 22.3 (which is also an extension of Example 12.2) illustrates the use of the 
Kruskal-Wallis one-way analysis of variance by ranks with a design involving k = 4 groups 
and unequal sample sizes. 


Example 22.2 Doctor Radical, a math instructor at Logarithm University, has three classes in 
advanced calculus. There are five students in each class. The instructor uses a programmed 
textbook in Class 1, a conventional textbook in Class 2, and his own printed notes in Class 3. At 
the end of the semester, in order to determine if the type of instruction employed influences 
student performance, Dr. Radical has another math instructor, Dr. Root, rank the 15 students 
in the three classes with respect to math ability. The rankings of the students in the three classes 
follow: Class 1: 9.5, 14.5, 12.5, 14.5, 12.5; Class 2: 6, 9.5, 3, 9.5, 3; and Class 3: 1,9.5,6,3, 
6 (assume the lower the rank, the better the student). 


Note that whereas in Example 22.1, Group 1 (the group with the highest sum of ranks) has 


the best performance, in Example 22.2, Class 3 (the class with the lowest sum of ranks) is 
evaluated as the best class. This is the case, since in Example 22.2 the lower a student's rank the 
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better the student, whereas in Example 22.1 the lower a subject's rank, the poorer the subject 
performed. 


Example 22.3 Doctor Radical, a math instructor at Logarithm University, has four classes in 
advanced calculus. There are six students in Class 1, seven students in Class 2, eight students 
in Class 3, and six students in Class 4. The instructor uses a programmed textbook in Class 1, 
a conventional textbook in Class 2, his own printed notes in Class 3, and no written instructional 
material in Class 4. At the end of the semester, in order to determine if the type of instruction 
employed influences student performance, Dr. Radical has another math instructor, Dr. Root, 
rank the 27 students in the four classes with respect to math ability. The rankings of the students 
in the four classes follow: Class 1: 1, 2, 4, 6, 8, 9; Class 2: 10, 14, 18, 20, 21, 25, 26; Class 3: 
3, 5, 7, 11, 12 16, 17, 22; Class 4: 13, 15, 19, 23, 24, 27 (assume the lower the rank, the better 
the student). 


Example 22.3 provides us with the following information: n, = 6, n, = 7, n, = 8, 
n, = 6,and N = 27. The sums of the ranks for the four groups are: XR, = 30, XR, = 134, 
ER, = 93, YR, = 121. Substituting the appropriate values in Equation 22.1, the value 
H - 14.99 is computed. 
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Employing Equation 22.2, df= 4—1 23. For df= 3, the tabled critical .05 and .01 chi- 
square values are Xos = 7.81 and Xi = 11.34. Since the obtained value H = 14.99 is greater 
than both of the aforementioned critical values, the null hypothesis can be rejected at both the 
.05 and .01 levels. Thus, one can conclude that the rankings for at least two of the four classes 
differed significantly from one another. Although multiple comparisons will not be conducted 
for this example, visual inspection of the data suggests that the rankings for Class 1 are 
dramatically superior to those for the other three classes. 
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Endnotes 


1. Although it is possible to conduct a directional analysis, such an analysis will not be 
described with respect to the Kruskal-Wallis one-way analysis of variance by ranks. A 
discussion of a directional analysis when k = 2 can be found under the Mann-Whitney U 
test. A discussion of the evaluation of a directional alternative hypothesis when k > 3 can 
be found in Section VII of the chi-square goodness-of-fit test (Test 8). Although the latter 
discussion is in reference to analysis of a k independent samples design involving categorical 
data, the general principles regarding analysis of a directional alternative hypothesis when 
k > 3 are applicable to the Kruskal-Wallis one-way analysis of variance by ranks. 


2. As noted in Section IV, the chi-square distribution provides an approximation of the 
Kruskal-Wallis test statistic. Although the chi-square distribution provides an excellent 
approximation of the Kruskal-Wallis sampling distribution, some sources recommend the 
use of exact probabilities for small sample sizes. Exact tables of the Kruskal-Wallis 
distribution are discussed in Section VII. 
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3. Inthe discussion of comparisons under the single-factor between-subjects analysis of va- 
riance, it is noted that a simple (also known as a pairwise) comparison is a comparison 
between any two groups in a set of k groups. 


4. Note that in Equation 22.5, as the value of N increases, the value computed for CD pw will 
also increase because of the greater number (range of values) of rank-orderings required 
for the data. 


5.  Therationale for the use of the proportions .0167 and .0083 in determining the appropriate 
value for z, d is as follows. In the case of a one-tailed/directional analysis, the relevant 
probability/proportion employed is based on only one of the two tails of the normal dis- 
tribution. Consequently, the proportion of the normal curve that is used to determine the 
value of z ,. will be a proportion that is equal to the value of c. in the appropriate tail 
of the distribution (which is designated in the alternative hypothesis). The value z = 2.13 
is employed, since the proportion of cases that falls above z = 2.13 in the right tail of the 
distribution is .0167, and the proportion of cases that falls below z 2 —2.13 in the left tail 
of the distribution is .0167. In the case of a two-tailed/nondirectional analysis, the relevant 
probability/proportion employed is based on both tails of the distribution. Consequently, 
the proportion of the normal curve that is used to determine the value of z ,, will be a 
proportion that is equal to the value of a, /2 in each tail of the distribution. The 
proportion &pc/2 = .0167/2 = .0083 is employed for a two-tailed/nondirectional analysis, 
since one-half of the proportion that comprises «pç = .0167 comes from the left tail of the 
distribution and the other half from the right tail. Consequently, the value z = 2.39 is 
employed, since the proportion of cases that falls above z = 2.39 in the right tail of the 
distribution is .0083, and the proportion of cases that falls below z = —2.39 in the left tail 
of the distribution is .0083. 


6. It should be noted that when a directional alternative hypothesis is employed, the sign of 
the difference between the two mean ranks must be consistent with the prediction stated in 
the directional alternative hypothesis. When a nondirectional alternative hypothesis is em- 
ployed, the direction of the difference between two mean ranks is irrelevant. 


7. a) Many researchers would probably be willing to tolerate a somewhat higher familywise 
Type I error rate than .05. In such a case the difference |R, - R,| = 6.5 will be 
significant, since the value of z, dj employed in Equation 22.5 will be less than z = 2.39, 
thus resulting in a lower value for CD pw- 

b) When there are a large number of ties in the data, a modified version of Equation 22.5 
is recommended by some sources (e.g., Daniel (1990)), which reduces the value of CD. 
by a minimal amount. Marascuilo and McSweeney (1977, p. 318) recommend that when 
ties are present in the data, the tie correction factor C = .964 computed with Equation 22.3 
be multiplied by the term in the radical of Equation 22.5. When the latter is done with the 
data for Example 22.1 (as noted below), the value CD, = 6.64 is obtained. As is the 
case with Equation 22.5, only the Group 1 versus Group 3 pairwise difference is significant. 


(15)(15 + 1) 
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c) In contrast to Equation 22.5, Conover (1980, 1999) employs Equation 22.6 for 
conducting pairwise comparisons. Conover (1980, 1999) states that Equation 22.6 is 
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recommended when there are no ties. Although he provides a tie correction equation, he 
notes that the result obtained with the tie correction equation will be very close to that 
obtained with Equation 22.6 when there are only a few ties in the data. Because of the 
latter, Equation 22.6 will be employed in the discussion to follow. The latter equation can 
yield a substantially lower CD,,, value than Equation 22.5. Equation 22.6 is analogous to 
Equation 23.4, which Conover (1980, 1999) recommends for conducting comparisons for 
the van der Waerden normal-scores test for k independent samples (which is discussed 
in the next chapter). Note that ifthe element [(N - 1 - H)/(N - k)] is omitted from the 
radical in Equation 22.6, it becomes Equation 22.5. It is demonstrated below that when 
Equation 22.6 is employed with the data for Example 22.1, the value CD, = 4.60 is 
obtained. If the latter CD pw value is used, both the Group 1 versus Group 3 and Group 1 
versus Group 2 comparisons are significant (the latter being the case, since the obtained 
difference of 6.5 for the Group 1 versus Group 2 comparison is greater than CD gwy -4.60). 
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Although Equation 22.6 allows for a more powerful test of an alternative hypothesis 
than Equation 22.5, some sources would argue that it does not adequately control the 
familywise Type I error rate. Hollander and Wolfe (1999) discuss alternative approaches 
for conducting pairwise comparisons for the Kruskal-Wallis test. 

In view of the fact that sources do not agree on the methodology for conducting 
pairwise comparisons, if two or more methods yield dramatically different results for a 
specific comparison, one or more replication studies employing reasonably large sample 
sizes should clarify whether or not an obtained difference is reliable, as well as the 
magnitude of the difference (if, in fact, one exists). 


8. In Equation 22.5 the value Z = 1.96 is employed for Z,qj> and the latter value is multi- 
plied by 2.83, which is the value computed for the term in the radical of the equation for 
Example 22.1. 


9. The slight discrepancy is due to rounding off error, since the actual absolute value of z 
computed with Equation 12.4 is 1.7756. 


10. Inaccordance with one of the assumptions noted in Section I for the Kruskal-Wallis one- 
way analysis of variance by ranks, in both Examples 22.2 and 22.3 it is assumed that Dr. 
Radical implicitly or explicitly evaluates the N students on a continuous interval/ratio scale 
prior to converting the data into a rank-order format. 
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Test 23 


The van der Waerden Normal-Scores 


Test for k Independent Samples 
(Nonparametric Test Employed with Ordinal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test Are k independent samples derived from identical popula- 
tion distributions? 


Relevant background information on test The van der Waerden normal-scores test for k 
independent samples (van der Waerden (1952/1953) is employed with ordinal (rank-order) data 
in a hypothesis testing situation involving a design with two or more independent samples. The 
van der Waerden test is one of a number of normal-scores tests that have been developed for 
evaluating data. Normal-scores tests transform a set of rank-orders into a set of standard 
deviation scores (i.e., z scores) based on the standard normal distribution. Marascuilo and 
McSweeney (1977, p. 280) note that normal-scores tests are often described as distribution free 
tests, insofar as the shape of the underlying population distribution(s) for the original data has 
little effect on the results of such tests. Because of their minimal assumptions, normal-scores 
tests are categorized as nonparametric tests. 

Just as converting a set of interval/ratio data into rank-orders transforms the data into what 
is essentially a uniform distribution, transforming a set of ranks into normal-scores transforms 
the ranks into essentially a normal distribution. Although the normal-scores transformation 
results in some loss of the information contained in the original data, much of the information 
is retained, albeit in a different format. Conover (1980, 1999) notes that when the underlying 
population distributions for the original data are normal, by virtue of employing a normal-scores 
transformation, the statistical power of the resulting normal-scores test will generally be equal 
to that of the analogous parametric test, and that when the underlying population distributions 
for the original data are not normal, the power of the normal-scores test may actually be higher 
than that of the analogous parametric test.! 

If the result of the van der Waerden normal-scores test for k independent samples is 
significant, it indicates there is a significant difference between at least two of the samples in the 
set of k samples. Consequently, the researcher can conclude there is a high likelihood that at 
least two of the k samples are not derived from the same population. Because of the latter, there 
is a high likelihood that the magnitude of the scores in one distribution is greater than the 
magnitude of the scores in the other distribution. 

Asisthe case with the Kruskal-Wallis one-way analysis of variance by ranks (Test 22), 
in employing the van der Waerden normal-scores test for k independent samples, one of the 
following is true with regard to the rank-order data that are evaluated: a) The data are in a rank- 
order format, since it is the only format in which scores are available; or b) The data have been 
transformed into a rank-order format from an interval/ratio format, since the researcher has reason 
to believe that one or more of the assumptions of the single-factor between-subjects analysis 
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of variance (Test 21) (which is the parametric analog of the Kruskal-Wallis test and the van 
der Waerden test) are saliently violated. It should be noted that when a researcher elects to 
transform a set of interval/ratio data into ranks, information is sacrificed. This latter fact accounts 
for why there is reluctance among some researchers to employ nonparametric tests such as the 
Kruskal-Wallis one-way analysis of variance by ranks and the van der Waerden normal- 
scores test for k independent samples, even if there is reason to believe that one or more of the 
assumptions of the single-factor between-subjects analysis of variance have been violated. 

Conover (1980, 1999) notes that the van der Waerden normal-scores test for k inde- 
pendent samples is based on the same assumptions as the Kruskal-Wallis one-way analysis 
of variance by ranks, which are as follows: a) Each sample has been randomly selected from 
the population it represents; b) The k samples are independent of one another; c) The dependent 
variable (which is subsequently ranked) is a continuous random variable. In truth, this assump- 
tion, which is common to many nonparametric tests, is often not adhered to, in that such tests are 
often employed with a dependent variable that represents a discrete random variable; and d) The 
underlying distributions from which the samples are derived are identical in shape. The shapes 
of the underlying population distributions, however, do not have to be normal. 


II. Example 


Example 23.1 is identical to Examples 21.1/22.1 (which are respectively evaluated with the 
single-factor between-subjects analysis of variance and the Kruskal-Wallis one-way analysis 
of variance by ranks). In evaluating Example 23.1 it will be assumed that the ratio data (1.e., 
the number of nonsense syllables correctly recalled) are rank-ordered, since one or more of the 
assumptions of the single-factor between-subjects analysis of variance have been saliently 
violated. 


Example 23.1 A psychologist conducts a study to determine whether or not noise can inhibit 
learning. Each of 15 subjects is randomly assigned to one of three groups. Each subject is given 
20 minutes to memorize a list of 10 nonsense syllables which she is told she will be tested on the 
following day. The five subjects assigned to Group 1, the no noise condition, study the list of 
nonsense syllables while they are in a quiet room. The five subjects assigned to Group 2, the 
moderate noise condition, study the list of nonsense syllables while listening to classical music. 
The five subjects assigned to Group 3, the extreme noise condition, study the list of nonsense 
syllables while listening to rock music. The number of nonsense syllables correctly recalled by 
the 15 subjects follows: Group 1: 8, 10, 9, 10, 9; Group 2: 7, 8, 5, 8, 5; Group 3: 4,8, 7, 5, 


7. Do the data indicate that noise influenced subjects’ performance? 
III. Null versus Alternative Hypotheses 


Null hypothesis H,: The k= 3 groups are derived from the same population. 


(If the null hypothesis is supported, the averages of the normal-scores for each of the k 2 3 groups 
will be equal (ie, z, = z, = z,). If the latter is true, it indicates that the three underlying 
populations are equivalent with respect to the magnitude of the scores in each of the distributions.) 


Alternative hypothesis H,: At least two of the k = 3 groups are not derived from the same 
population. 


(If the alternative hypothesis is supported, the averages of the normal-scores computed for at least 
two of the k 2 3 groups will not be equal to one another. It is important to note that the alternative 
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hypothesis does not require that Z + Z + Z , since the latter implies that all three normal-scores 
means must differ from one another. If the alternative hypothesis is supported, it indicates that 
at least two of the three populations are not equivalent with respect to the magnitude of the scores 
in each of the distributions. In this book it will be assumed (unless stated otherwise) that the 
alternative hypothesis for the van der Waerden normal- scores test for k independent samples 
is stated nondirectionally.) 


IV. Test Computations 


The total number of subjects employed in the experiment is N = 15, and there are n= Ny 
= n, = n, = 5 subjects in each group. The use of the van der Waerden normal-scores test 
for k independent samples assumes that the ratio scores of the N = 15 subjects have been rank- 
ordered in accordance with the rank-ordering procedure described for the Kruskal-Wallis one- 
way analysis of variance by ranks (the aforementioned rank-ordering procedure is described 
in Section IV of the latter test). Table 23.1 summarizes the data for Example 23.1. The table 
lists the following values for each of the n, = 5 subjects in the k = 3 groups (where j indicates 
the j" group): a) The original ratio scores of the subjects (i.e., X. the number of nonsense 
syllables correctly recalled); b) The rank-order of each score (R.) within the framework of the 
rank-ordering procedure described for the Kruskal-Wallis one-way analysis of variance by 
ranks; and c) The normal score value (zj) computed for each rank-order. 


Table 23.1 Summary of Data for Example 23.1 


Group 1 Group 2 Group 3 
X, R, Z X, R, Z X, R, Z3 
8 9.5 .24 T 6 —.32 4 1 -1.53 
10 14.5 1.32 8 9.5 24 8 9.5 .24 
9 12.5 .18 5 3 —.89 7 6 —.32 
10 14.5 1.32 8 9.5 .24 5 3 —.89 
9 12.5 .18 5 3 —.89 7 6 —.32 
Ez =4.44 Ez, = -1.62 Ez, = -2.82 
z,- .888 Z, = —324 z = —564 


As noted above, in order to conduct the van der Waerden normal-scores test for k in- 
dependent samples, each of the rank-orders must be converted into a normal score. Since the 
latter conversion is based on cumulative proportions (or percentiles) for the normal distribution, 
at this point in the discussion the reader may want to review the following material: a) The 
material in the Introduction on percentiles and cumulative proportions (within the context of a 
cumulative frequency distribution), and the material on the normal distribution; and b) The 
material on cumulative proportions in Section I of the Kolmogorov-Smirnov goodness-of-fit 
test for a single sample. 

The following protocol is employed to convert a rank-order into a normal score: Divide 
the rank-order by the value (N + 1). The resulting value will be a proportion (which will be 
designated as P) that is greater than 0 but less than 1. The proportion that is computed is con- 
ceptualized as the percentile for that score (when the decimal point is moved two places to the 
right). The standard normal score (i.e., z value) which corresponds to that percentile will 
represent the normal score for that rank-order. The latter z value is obtained through use of 
Table A1 (Table of the Normal Distribution) in the Appendix. 

To illustrate, the computation of the normal score of z = .24 for Subject 1 in Group 1 will 
now be explained. The original score for the subject in question is 8, which is assigned a rank- 
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order of 9.5 within the framework of the Kruskal-Wallis ranking procedure. Employing the 
procedure described above for converting a rank-order into a proportion, the rank-order 9.5 is 
divided by N+ 12 15 + 1 = 16. Thus, P = 9.5/16 = .5938. The latter proportion is con- 
ceptualized as a cumulative proportion or percentile in a normal distribution (by moving the 
decimal point two places to the right, the proportion .5938 is converted into 59.38%). This result 
tells us that the normal score for the rank-order 9.5 will be the z value that corresponds to the 
cumulative proportion .5938 in the normal distribution (or expressed within the framework of 
a percentile, the z value that falls at the 59.38th percentile). The point in question corresponds 
to the point on the normal distribution below which 59.38% of the cases fall (or to state it 
proportionally, .5938 is the proportion of cases that fall below that point). Since the latter 
proportion/percentile is above .50/50% (i.e. above the mean of the normal distribution), the sign 
of the z value will be positive. In order to identify the appropriate z value, we look in Column 
2 of Table A1 for the value that is closest to .5938 — .5000 = .0938. The latter value is 
z = 24, since the entry .0948 in Column 2 for z = .24 is closest to .0938.° 

The general rule for determining the normal score for any proportion (P) that is greater than 
.5000 is to find in Column 2 of Table A1 the proportion (which we will designate as Q) that is 
closest to the difference between the value of P and .5000 (i.e., P — .5000 = Q). The z value in 
the row that corresponds to the value of Q will be the normal score for the proportion P. 

The computation of a normal score for a proportion than is less than .5000 will now be 
explained. To demonstrate this we will use the normal score of z = —1.53 for Subject 1 in Group 
3. The original score for the subject in question is 4, which is assigned a rank-order of 1 within 
the framework of the Kruskal-Wallis ranking procedure. Employing the protocol for 
converting a rank-order into a proportion, the rank-order 1 is divided by N+ 1 = 15-1216 
Thus, P = 1/16 = .0625. The latter proportion is conceptualized as a cumulative proportion or 
percentile in a normal distribution (by moving the decimal point two places to the right, the 
proportion .0625 is converted into 6.25%). This result tells us that the normal score for the rank- 
order 1 will be the z value that corresponds to the cumulative proportion .0625 in the normal 
distribution (or expressed within the framework of a percentile, the z value that falls at the 6.25th 
percentile). The point in question corresponds to the point on the normal distribution below 
which 6.25% of the cases fall (or to state it proportionally, .0625 is the proportion of cases that 
fall below that point). Since the latter proportion/percentile is below .50/50% (i.e. below the 
mean of the normal distribution), the sign of the z value will be negative. In order to identify 
the appropriate z value, we look in Column 3 of Table A1 for the value that is closest to .0625. 
The latter value is z = 1.53, since the entry .0630 in Column 3 for z = 1.53 is closest to .0625. 

The general rule for determining the normal score for any proportion (P) that is less than 
.5000 is to find in Column 3 of Table A1 the proportion that is closest to the value of P. The 
z value in the row that corresponds to the value of P will be the normal score for the proportion 
P. A negative sign is assigned to the obtained z value. 

After computing a normal score for each of the 15 rank-orders in Table 23.1, the sum of the 
normal-scores is computed for each group. The latter values in Table 23.1 are Xz, = 4.44, 
Xz, = -1.62, Xz, = -2.82. An average normal score for each group is computed by dividing 
the sum of the normal-scores for the group by the number of subjects in the group. Thus: 
z = Mz/n,-444/5 = 888, z, = Mz,/n, = -1.62/5 = —324, and z, = Xz,/n, = —2.82/5 
= -.564. In order to compute the test statistic for the van der Waerden normal-scores test for 
k independent samples, it is necessary to compute the estimated population variance (57) of 
the normal-scores. The latter is computed with Equation 23.1. 
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ee fe (Equation 23.1) 


The notation in Equation 23.1 indicates the following: a) If we let z,, represents the normal 
score for the i” subject in Group j, then ae Xam indicates that each of the N (i.e., (k)(n) = 
N) 
normal-scores is squared, and the N squared normal-scores are summed; and b) The sum of the 
squared normal scores is divided by (N — 1), which yields the value of the variance. Employing 
Equation 23.1, the value $? = .7482 is computed. Conover (1980, 1999) notes that the 
computed value for the variance will generally be close to 1.* 


15 - 1 


= .7482 


The chi-square distribution provides an excellent estimate of the exact probability dis- 
tribution for the van der Waerden test statistic. The chi-square estimate of the test statistic (to 
be designated X2.) is computed with Equation 23.2. 


k 
Y n 
2. jel : 
Xvaw = ————— (Equation 23.2) 


$82 


The notation in Equation 23.2 indicates the following: a) In the numerator of the equation, 
the number of subjects in each group (1) is multiplied by the square of the mean of the normal- 
scores for that group. Upon doing the latter for all k groups, the k resulting values are summed; 
and b) The final sum obtained in part a) is divided by the estimated population variance 
computed with Equation 23.1. 

When Equation 23.2 is employed to compute the test statistic for the van der Waerden 
normal-scores test for k independent samples, the value XL, - 8.10 is obtained. 





2 _ (5)(.888)" + (5)(-.324)" + (5)(-.564)* g 
oe 7482 l 


10 


V. Interpretation of the Test Results 


In order to reject the null hypothesis the computed value Xw must be equal to or greater than 
the tabled critical chi-square value at the prespecified level of significance. The computed chi- 
square value is evaluated with Table A4 (Table of the Chi-Square Distribution) in the Ap- 
pendix. For the appropriate degrees of freedom, the tabled Xs value (which is the chi-square 
value at the 95th percentile) and the tabled X value (which is the chi-square value at the 99th 
percentile) are employed as the .05 and .01 critical values for evaluating a nondirectional 
alternative hypothesis. The number of degrees of freedom employed in the analysis are 
computed with Equation 23.3. Thus, df=3-1=2. 


df=k-1 Equation 23.3) 


For df = 2, the tabled critical .05 and .01 chi-square values are Xas = 5.99 and 
Xo = 9.21. Since the computed value X5. - 8.10 is greater than Xs = 5.99, the alternative 
hypothesis is supported at the .05 level. Since, however, X3, - 8.10 is less than Xo = 9.21, 
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the alternative hypothesis is not supported at the .01 level. A summary of the analysis of 
Example 23.1 with the van der Waerden normal-scores test for k independent samples 
follows: It can be concluded that there is a significant difference between at least two of the 
three groups exposed to different levels of noise. This result can be summarized as follows: 
X2 = 8.10, p < .05. 

It should be noted that when the data for Example 23.1 are evaluated with the Kruskal- 
Wallis one-way analysis of variance by ranks, the identical conclusions are reached (i.e., the 
null hypothesis can be rejected at the .05 level, but not at the .01 level — although the Kruskal- 
Wallis test results in a slightly larger chi-square value). When the data are evaluated with 
single-factor between-subjects analysis of variance, the null hypothesis can be rejected at both 
the .05 and .01 levels (although it barely achieves significance at the latter level). The slight 
discrepancy between the results of the van der Waerden test and the analysis of variance 
suggests that in the case of Examples 21.1/22.1/23.1, it would appear that the analysis of 
variance provides a slightly more powerful test of the alternative hypothesis than either the van 
der Waerden or Kruskal-Wallis tests. 


VI. Additional Analytical Procedures for the van der Waerden 
Normal-Scores Test for k Independent Samples 


1. Pairwise comparisons following computation of the test statistic for the van der 
Waerden normal-scores test for k independent samples Prior to reading this section the 
reader should review the discussion of comparisons in Section VI of the single-factor between- 
subjects analysis of variance. Asis the case with the omnibus F value computed for the single- 
factor between-subjects analysis of variance, the XL. value computed with Equation 23.2 is 
based on an evaluation of all k groups. When the value of Xi is significant, it does not indicate 
whether just two or, in fact, more than two groups differ significantly from one another. In order 
to answer the latter question, it is necessary to conduct comparisons contrasting specific groups 
with one another. This section will describe a methodology that can be employed for conducting 
simple/pairwise comparisons following the computation of an Xe value.° 

In conducting a simple comparison between any two groups to be designated a and b, the 
null hypothesis and nondirectional alternative hypothesis are as follows: Hy: Groups a and b 
are derived from identical population distributions; H,: Groups a and b are not derived from 
identical population distributions. The alternative hypothesis can also be stated directionally 
insofar as the researcher can predict that the magnitude of the scores in one distribution is greater 
than the magnitude of the scores in the other distribution. As is the case with the omnibus null 
hypothesis, the decision made with regard to the null hypothesis for a comparison will be a 
function of the magnitude of the difference between the averages of the normal-scores for the 
groups that are involved in the comparison. 

Conover (1980, 1999) describes the use of Equation 23.4 to conduct comparisons for the 
van der Waerden normal-scores test for k independent samples. The latter equation allows 
aresearcher to identify the minimum required difference between the means of the normal-scores 
of any two groups (designated as CD,,,) in order for them to differ from one another at the 
prespecified level of significance. 


vdw 






CD 8 


vdw ~ laaj 





| | i + | (Equation 23.4) 
n n, 
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Where: n, and n, represent the number of subjects in each of the groups involved in the 
simple comparison 


The value of t, d is obtained from the Table A2 (Table of Student's t Distribution) in 
the Appendix. In the case of a nondirectional alternative hypothesis, £,,. is the t value for 
df = N — k, above which a proportion of cases corresponding to the value &,,,/2c falls (where 
c is the total number of comparisons that are conducted). In Example 23.1, df= 15 - 32 12. In 
the case of a directional alternative hypothesis, f... is the t value above which a proportion of 
cases corresponding to the value «,,,/c falls. When all possible pairwise comparisons are made 
c = [k(k — 1)]/2, and thus, 2c = k(k — 1). In Example 23.1 the number of pairwise/simple 
comparisons that can be conducted are c = [3(3 — 1)]/2 2 3 — specifically, Group 1 versus Group 
2, Group 1 versus Group 3, and Group 2 versus Group 3. 

The value of tą; will be a function of both the maximum familywise Type I error rate 
(Qy) the researcher is willing to tolerate and the total number of comparisons that are con- 
ducted. When a limited number of comparisons are planned prior to collecting the data, most 
sources take the position that a researcher is not obliged to control the value of &pẹ. In such 
a case, the per comparison Type I error rate («,,.) will be equal to the prespecified value 
of alpha. When @,,, is not adjusted, the value of t, d employed in Equation 23.4 will be the 
tabled critical t value that corresponds to the prespecified level of significance. Thus, if a 
nondirectional alternative hypothesis is employed and « = «4. = .05, for df = 12, the 
tabled critical two-tailed .05 value £9; = 2.179 is used to represent f, d in Equation 23.4. If 
& = Qc = .01, the tabled critical two-tailed .01 value z,, = 3.055 is used in Equation 23.4. 
In the same respect, if a directional alternative hypothesis is employed, the tabled critical .05 and 
.01 one-tailed values ¢,, = 1.782 and ź = 2.681 are used for Lj in Equation 23.4. 

When comparisons are not planned beforehand, it is generally acknowledged that the value 
of «,,,, must be controlled so as not to become excessive. The general approach for controlling 
the latter value is to establish a per comparison Type I error rate which insures that & pẹ will 
not exceed some maximum value stipulated by the researcher. One method for doing this 
(described under the single-factor between-subjects analysis of variance as the Bonferroni- 
Dunn method) establishes the per comparison Type I error rate by dividing the maximum 
value one will tolerate for the familywise Type I error rate by the total number of comparisons 
conducted. Thus, in Example 23.1, if one intends to conduct all three pairwise comparisons and 
wants to insure that «pẹ does not exceed .05, «pc = Opy/c = .05/3 = .0167. The latter 
proportion is used to determine the value of f, "T As noted earlier, if a directional alternative 
hypothesis is employed for a comparison, the value of f, dj employed in Equation 23.4 is the t 
value above which a proportion equal to @,,. = &4,/c of the cases falls. In Table A2, by 
interpolation (since the exact value is not listed) the t value that corresponds approximately to 
the proportion .0167 (for df 2 12) is t 2 2.35. By employing ba^ 2.35 in Equation 23.4, 
one can be assured that within the “family” of three pairwise comparisons, 0,., will not exceed 
.05 (assuming all of the comparisons are directional). If a nondirectional alternative hypothesis 
is employed for all of the comparisons, the value of tą; will be the t value above which a pro- 
portion equal to &,,/2c = «5/2 of the cases falls. Since &,./2 = .0167/2 = .0083, t = 2.75. 
By employing f, dp 2.75 in Equation 23.4, one can be assured that 0,,,, will not exceed 45." 

In order to employ the CD,,, value computed with Equation 23.4, it is necessary to 
determine the mean normal score for each of the k groups (which we have already done), and 
then compute the absolute value of the difference between the mean normal-scores of each pair 
of groups that are compared.* In Table 23.1 the following values for the mean normal- scores 
of the groups are computed: z, = .888,z, = -.324, z, = -.564. Employing the latter values, 
Table 23.2 summarizes the difference scores between pairs of mean normal-scores. 
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Table 23.2 Difference Scores Between Pairs 
of Mean Normal-Scores for Example 23.1 


Izi - Z| = |-888 - (-.324)| = 1.212 
Izi - z,| = |-888 - (-.564)| = 1.452 
|Z, - z| = |(-.324) - (-.564)| = .24 


If any of the differences between mean normal-scores is equal to or greater than the 
CD,,, Value computed with Equation 23.4, a comparison is declared significant. Equation 23.4 
will now be employed to evaluate the nondirectional alternative hypothesis for all three pairwise 
comparisons. Since it will be assumed that the comparisons are unplanned and that the 
researcher does not want the value of Oy to exceed .05, the value f, d 2.75 will be used in 
computing CD iw: 





CD,,, = Q5) (7480) 1820] | 1 


1 
= + =} = (2.75)(.3836) = 1.055 
15.73 5 : aa, 


The obtained value CD „= 1.055 indicates that any difference between the mean normal- 
scores of two groups that is equal to or greater than 1.055 is significant. With respect to the three 
pairwise comparisons, the differences |z, - z,| = 1.212 and |z; - z,| = 1.452 are greater 
than CD,,,,= 1.055. Thus, we can conclude there is a significant difference between the 
performance of Group 1 and Group 2 and between the performance of Group 1 and Group 3? 

The same logic employed for computing a confidence interval for a comparison described 
in Section VI of the single-factor between-subjects analysis of variance can be employed to 
compute a confidence interval for the van der Waerden test model. Specifically: Add to and 
subtract the computed value of CD,,, from the obtained difference between the two mean 
normal-scores involved in a comparison. Thus, CI ,, (based on à, = .05) for the comparison 
Group 1 versus Group 2 is computed as follows: CI, =1.212 + 1.055. In other words, the 
researcher can be 95% confident (or the probability is .95) that the mean of the normal-scores 
in the population represented by Group 1 is between .157 and 2.267 standard deviation units 
larger than the mean of the normal-scores in the population represented by Group 2. 


VII. Additional Discussion of the van der Waerden Normal-Scores 
Test for k Independent Samples 


1. Alternative normal-scores tests Alternative normal-scores procedures have been developed 
Terry and Hoeffding (Terry (1952)) and Bell and Doksum (1965). The latter procedures are 
described in Marascuilo and McSweeney (1977), who also discuss the extension of normal- 
scores tests to other experimental designs (e.g., repeated measures/matched-samples designs). 
The alternative normal-scores tests employ a different procedure from the one described for the 
van der Waerden test for determining a normal score for each of the rank-orders in a sample. 
For example, the Bell-Doksum normal- scores test (1965) obtains N random normal deviates 
(which are typically generated with a pseudorandom number generator, which is discussed in 
Section IX (the Addendum) of the single-sample runs test (Test 10)), ordinally arranges the 
N random normal deviates (i.e., the N z scores), and then pairs each of the random deviates (z 
scores) with the rank-order in the same ordinal position. 
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VIII. Additional Examples Illustrating the van der Waerden 
Normal-Scores test for k Independent Samples 


The van der Waerden normal-scores test for k independent samples can be employed to 
evaluate any of the additional examples noted for the single-factor between-subjects analysis 
of variance, if the data for the latter examples are rank-ordered, and the rank-orders are then 
converted into normal-scores. The van der Waerden test can also be employed to evaluate 
Examples 22.2 and 22.3, which are presented in Section VIII of the Kruskal-Wallis one-way 
analysis of variance by ranks. 

As noted earlier, the van der Waerden normal-scores test for k independent samples 
can be employed to evaluate a design involving k = 2 independent samples/groups. Thus, the test 
can be used to evaluate the data for any of the examples noted for the f test for two independent 
samples (Test 11) and the Mann-Whitney U test (Test 12) (assuming the interval/ratio scores 
are rank-ordered, and the ranks are then converted into normal-scores). Example 23.2 is identical 
to Example 11.1/12.1, which is employed to illustrate the use of the ¢ test for two independent 
samples and the Mann-Whitney U test. The example will be employed to illustrate the use of 
the van der Waerden normal-scores test for k independent samples when there are k = 2 
independent samples/groups. 


Example 23.2 In order to assess the efficacy of a new antidepressant drug, ten clinically 
depressed patients are randomly assigned to one of two groups. Five patients are assigned to 
Group 1, which is administered the antidepressant drug for a period of six months. The other 
five patients are assigned to Group 2, which is administered a placebo during the same six-month 
period. Assume that prior to introducing the experimental treatments, the experimenter con- 
firmed that the level of depression in the two groups was equal. After six months elapse, all ten 
subjects are rated by a psychiatrist (who is blind with respect to a subject's experimental 
condition) on their level of depression. The psychiatrist's depression ratings for the five subjects 
in each group follow (the higher the rating, the more depressed a subject): Group 1: 11, 1,0, 
2,0; Group 2: 11,11, 5, 8, 4. Do the data indicate that the antidepressant drug is effective? 


Table 23.3 summarizes the analysis of the data with the van der Waerden normal- scores 
test for k independent samples. The null hypothesis evaluated with the testis: Hy: The k = 
2 groups are derived from the same population. The nondirectional alternative hypothesis 
evaluated with the test is: H,: The k = 2 groups are not derived from the same population. If 
a directional alternative hypothesis is employed, it will predict that the two groups are derived 
from different populations, but more specifically, that the magnitude of the scores in Group 2 will 
be larger than the magnitude of the scores in Group 1. 

The computed value for the van der Waerden test statistic is Cw = 3.16. Since k=2, 
df=2-1=1. Employing Table A4, we determine that for df= 1, the tabled critical two-tailed 
values that are employed to evaluate a nondirectional alternative hypothesis are Xos =3.84 and Xo 
= 6.63. Since the computed value Xa = 3.16 is less than Xos = 3.84, the nondirectional 
alternative hypothesis is not supported at the .05 level. The tabled critical one-tailed values that 
are employed for evaluating a directional alternative hypothesis are Xos =2.71 and Xo - 5.10. 
Since the computed value XL. = 3.16 = 3.16 is greater than Xs = 2.71, but less than Xo = 
5.10, the directional alternative hypothesis predicting that Group 1 (the group that received the 
drug) will have a lower level of depression is supported at the .05 level, but not at the .01 level. 
This result is consistent with that obtained when the same data are evaluated with the ¢ test for 
two independent samples and the Mann-Whitney U test — in other words, in the case of the 
latter tests, the same directional alternative hypothesis is supported, but only at the .05 level. 
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Table 23.3 Summary of Data for Example 23.2 





Group 1 Group 2 
X, R, Zi X, R, £ 
11 9 .91 11 9 .91 
1 3 —.60 11 9 .91 
0 1.5 -1.10 5 6 M 
2 4 —.35 8 7 .35 
0 1.5 -1.10 4 5 —.11 
Dz, 2-224 2215247 
z = —448 A = 434 
3 E 2 2 - 2 
gi (9D? 5607 du t (35 + (CID reus 
10-1 
2 2 2 
"e (5.448)? + 00434? — 3 16 
.6148 
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Endnotes 


1. Conover (1980, 1999) and Marascuilo and McSweeney (1977) note that normal-scores tests 
have equal or greater power than their parametric analogs. The latter sources state that the 
asymptotic relative efficiency (which is discussed in Section VII of the Wilcoxon signed- 
ranks test (Test 6)) of a normal-scores test is equal to 1 when the underlying population 
distribution(s) are normal, and often greater than 1 when the underlying population distri- 
bution(s) are something other than normal. What the latter translates into is that for a given 
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level of power, a normal-scores test will require an equal number or even fewer subjects than 
the analogous parametric test in evaluating an alternative hypothesis. 


2. Although it is possible to conduct a directional analysis when k > 3, such an analysis will not 
be described with respect to the van der Waerden normal-scores test for k independent 
samples. A discussion of a directional analysis when k = 2 can be found in Section VIII 
where the van der Waerden test is employed to evaluate the data for Examples 11.1/12.1 
(which are employed to illustrate the £ test for two independent samples and the Mann- 
Whitney U test). A discussion of the evaluation of a directional alternative hypothesis when 
k > 3 can be found in Section VII of the chi-square goodness-of-fit test (Test 8). Although 
the latter discussion is in reference to analysis of a k independent samples design involving 
categorical data, the general principles regarding analysis of a directional alternative 
hypothesis when k > 3 are applicable to the van der Waerden normal-scores test for k 
independent samples. 


3. The proportion of cases in the normal distribution that fall below the mean is .5000. The 
value .0948 in Column 2 represents the proportion of cases that fall between the mean and 
the value z = .24. Thus, .5000 + .0948 = .5948 represents to proportion of cases that fall 
below the value z = .24. 


4. Conover (1980, 1999) notes that if there are no ties, the mean of the N Z; Scores (i.e., the 
mean of all N normal-scores) will equal zero, and be extremely close to zero when there are 
ties. Thus, if the mean equals zero, the equation §* = X(X - X))/(n - 1) (which is 
Equation L5, the definitional equation for computing the unbiased estimate of a population 
variance) reduces to $? = XX? /(n - 1). If z is employed in place of X and N in place of 
n (since in Equation 23.1 the variance of N z scores is computed), we obtain Equation 23.1, 
8 = Xz; (N - 1). 


5. When two or more groups do not have the same sample size, the value of n, for a given 
group is used to represent the group sample size in any of the equations that require the group 
sample size. 


6. In the discussion of comparisons under the single-factor between-subjects analysis of 
variance, it is noted that a simple (also known as a pairwise) comparison is a comparison 
between any two groups in a set of k groups. 


7. Therationale for the use of the proportions .0167 and .0083 is explained more thoroughly in 
Endnote 5 of the Kruskal-Wallis one-way analysis of variance. In the case of the latter 
test, since the normal distribution is employed in the comparison procedure, the explanation 
of the proportions is in reference to a standard normal deviate (i.e., a z value). The same 
rationale applies when the f distribution is employed, with the only difference being that for 
a corresponding probability level, the t values that are used are different than the z values 
employed for the Kruskal-Wallis comparison procedure. 

The values £54, = 2.35 and tq, = 2.75 are based on interpolating the f values in 
Table A2 (since exact values are not listed for 7 oje; and £9,,). The value t,,., = 2.35 is 
the best estimate of the t value at the 98.33th percentile (since 1 —.9833 = .0167). The value 
tog3 = 2.75 is the best estimate of the ¢ value at the 99.17th percentile (since 1 — .9917 
= .0083). 
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8. 


10. 


It should be noted that when a directional alternative hypothesis is employed, the sign of the 
difference between the two mean normal-scores must be consistent with the prediction stated 
in the directional alternative hypothesis. When a nondirectional alternative hypothesis is 
employed, the direction of the difference between two mean normal-scores is irrelevant. 


. Equation 23.4 is analogous to Equation 22.6, which Conover (1980, 1999) employs in con- 


ducting a comparison for the Kruskal-Wallis one-way analysis of variance. If the element 
[VN -1- Xia)! (N - k)] is omitted from the radical in Equation 23.4, it becomes Equa- 
tion 23.5, which is analogous to Equation 22.5. Equation 23.5 will yield a larger CD aw 
value than Equation 23.4. It is demonstrated below that when Equation 23.5 is employed 
with the data for Example 23.1, the value CD,,, - 1.504 is obtained. If the latter 
CD,,,value is employed, none of the pairwise comparsions are significant, since no 
difference is equal to or greater than 1.504. As noted in the discussion of comparisons in the 
Kruskal-Wallis one-way analysis of variance, an equation in the form of Equation 23.5 
conducts a less powerful/ more conservative comparison. 


(Equation 23.5) 


OD e "fi + | = (2.75) LE + 3] = 1.504 


n n, 


In Section VI of the Kruskal-Wallis one-way analysis of variance it is noted that 
sources do not agree on the methodology for conducting pairwise comparisons for the latter 
test. The general comments regarding comparisons for the Kruskal-Wallis test can be ex- 
tended to the van der Waerden normal-scores test for k independent samples (since 
sources do not agree on the comparison protocol for the van der Waerden test). Thus, if 
two or more methods for conducting comparisons yield dramatically different results for a 
specific comparison, one or more replication studies employing reasonably large sample sizes 
should clarify whether or not an obtained difference is reliable, as well as the magnitude of 
the difference (if, in fact, one exists). 


The tabled critical one-tailed .05 and .01 values are, respectively, the tabled chi-square values 
at the 90th and 98th percentiles/quantiles of the chi-square distribution. For clarification 
regarding the latter values, the reader should review the material on the evaluation of a 
directional hypothesis involving the chi-square distribution in Section VII of the chi-square 
goodness-of-fit test, and the discussion of Table A4 in Section IV of the single-sample chi- 
square test for a population variance (Test 3). Since the chi-square value at the 98th per- 
centile is not in Table A4, the value Xo = 5.10 is an approximation of the latter value. 
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Inferential Statistical Tests Employed 
with Two or More Dependent Samples 
(and Related Measures of 
Association/Correlation) 


Test 24: The Single-Factor Within-Subjects Analysis 


of Variance 


Test 25: The Friedman Two-Way Analysis of Variance 


by Ranks 


Test 26: The Cochran Q Test 
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Test 24 


The Single-Factor Within-Subjects Analysis of Variance 
(Parametric Test Employed with Interval/Ratio Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test In a set of k dependent samples (where k > 2), do at least two 
of the samples represent populations with different mean values? 


Relevant background information on test Prior to reading this section the reader should re- 
view the general comments on the analysis of variance in Section I of the single-factor between- 
subjects analysis of variance (Test 21). In addition, the general information regarding a de- 
pendent samples design contained in Sections I and VII of the t test for two dependent samples 
(Test 17) should be reviewed. The single-factor within-subjects analysis of variance (which 
is also referred to as the single-factor repeated-measures analysis of variance and the 
randomized-blocks one-way analysis of variance) is employed in a hypothesis testing 
situation involving k dependent samples. In contrast to the £ test for two dependent samples, 
which only allows for a comparison of the means of two dependent samples, the single-factor 
within-subjects analysis of variance allows for comparison of two or more dependent samples. 
When the number of dependent samples is k = 2, the single-factor within-subjects analysis of 
variance and the ź test for two dependent samples yield equivalent results. 

In conducting the single-factor within-subjects analysis of variance, each of the k sample 
means is employed to estimate the value of the mean of the population the sample represents. 
If the computed test statistic is significant, it indicates there is a significant difference between 
at least two of the sample means in the set of k means. As a result of the latter, the researcher can 
conclude there is a high likelihood that at least two of the samples represent populations with 
different mean values. 

In order to compute the test statistic for the single-factor within-subjects analysis of 
variance, the following two variability components (which are part of the total variability) 
are contrasted with one another: between-conditions variability and residual variability. 
Between-conditions variability (which is also referred to as treatment variability) is essentially 
a measure of the variance of the means of the k experimental conditions. Residual variability is 
the amount of variability within the k scores of each of the n subjects which cannot be accounted 
for on the basis of a treatment effect. Residual variability is viewed as variability that results 
from chance factors that are beyond the control of a researcher, and since chance factors are often 
referred to as experimental error, residual variability is also referred to as error variability. 
The F ratio, which is the test statistic for the single-factor within-subjects analysis of 
variance, is obtained by dividing between-conditions variability by residual variability. Since 
residual variability is employed as a baseline measure of the variability in a set of data that is 
beyond a researcher's control, it is assumed that if the k experimental conditions represent 
populations with the same mean value, the amount of variability between the means of the k 
experimental conditions (i.e., between-conditions variability) will be approximately the same 
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value as the residual variability. If, on the other hand, between-conditions variability is 
significantly larger than residual variability (in which case the value of the F ratio will be larger 
than 1), it is likely that something in addition to chance factors is contributing to the amount of 
variability between the means of the experimental conditions. In such a case, it is assumed that 
whatever it is that differentiates the experimental conditions from one another (i.e., the indepen- 
dent variable/experimental treatments) accounts for the fact that between-conditions variability 
is larger than residual variability? A thorough discussion of the logic underlying the single- 
factor within-subjects analysis of variance can be found in Section VII. 

Thesingle-factor within-subjects analysis of variance is employed with interval/ratio data 
and is based on the following assumptions: a) The sample of n subjects has been randomly 
selected from the population it represents; b) The distribution of data in the underlying popu- 
lations each of the experimental conditions represents is normal; and c) The third assumption, 
which is referred to as the sphericity assumption, is the analog of the homogeneity of variance 
assumption of the single-factor between-subjects analysis of variance. The assumption of 
sphericity, which is mathematically more complex than the homogeneity of variance assumption, 
essentially revolves around the issue of whether or not the underlying population variances and 
covariances are equal. A full discussion of the sphericity assumption (as well as the concept of 
covariance) can be found in Section VI. It should also be noted that the single-factor within- 
subjects analysis of variance is more sensitive to violations of its assumptions than is the single- 
factor between-subjects analysis of variance. 

As is the case for the t test for two dependent samples, in order for the single-factor 
within-subjects analysis of variance to generate valid results the following guidelines should 
be adhered to: a) In order to control for order effects, the presentation of the k experimental 
conditions should be random or, if appropriate, be counterbalanced; and b) If matched samples 
are employed, within each set of matched subjects each of the subjects should be randomly 
assigned to one of the k experimental conditions. 

As is the case with the ¢ test for dependent samples, when k = 2 the single-factor within- 
subjects analysis of variance can be employed to evaluate a before-after design, as well as 
extensions of the latter design that involve more than two measurement periods. The limitations 
of the before-after design (which are discussed in Section VII of the ¢ test for dependent 
samples) are also applicable when itis evaluated with the single-factor within-subjects analysis 
of variance. 

The reader should take note of the fact that there are certain advantages associated with 
employing a within-subjects design as opposed to a between-subjects design. If within-subjects 
and between-subjects designs that evaluate the same hypothesis and involve the same number 
of scores in each of the experimental conditions are compared with one another, the number of 
subjects required for the within-subjects analysis is a fraction (specifically, 1/k ") of the number 
required for the between-subjects analysis. Another advantage of a within-subjects analysis is 
that it provides for a more powerful test of an alternative hypothesis. The latter can be attributed 
to the fact that the error variability associated with a within-subjects analysis is less than that 
associated with a between-subjects analysis. In spite of the aforementioned advantages of em- 
ploying a within-subjects design, the between-subjects design is more commonly employed in 
research, since in many experiments it is impractical for a subject to serve in more than one 
experimental condition. 


II. Example 


Example 24.1 A psychologist conducts a study to determine whether or not noise can inhibit 
learning. Each of six subjects is tested under three experimental conditions. In each of the 
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experimental conditions a subject is given 20 minutes to memorize a list of 10 nonsense syllables, 
which the subject is told she will be tested on the following day. The three experimental 
conditions each subject serves under are as follows: Condition 1, the no noise condition, 
requires subjects to study the list of nonsense syllables in a quiet room. Condition 2, the 
moderate noise condition, requires subjects to study the list of nonsense syllables while listening 
to classical music. Condition 3, the extreme noise condition, requires subjects to study the list 
of nonsense syllables while listening to rock music. Although in each of the experimental 
conditions subjects are presented with a different list of nonsense syllables, the three lists are 
comparable with respect to those variables that are known to influence a person's ability to learn 
nonsense syllables. To control for order effects, the order of presentation of the three 
experimental conditions is completely counterbalanced.* The number of nonsense syllables 
correctly recalled by the six subjects under the three experimental conditions follows. (Subjects' 
scores are listed in the order Condition 1, Condition 2, Condition 3.) Subject 1: 9, 7, 4; 
Subject 2: 10, 8, 7; Subject 3: 7, 5, 3; Subject 4: 10, 8, 7; Subject 5: 7, 5, 2; Subject 6: 8, 
6, 6. Do the data indicate that noise influenced subjects’ performance? 


III. Null versus Alternative Hypotheses 


Null hypothesis Ay: Py = By = Bs 


(The mean of the population Condition 1 represents equals the mean of the population Condition 
2 represents equals the mean of the population Condition 3 represents.) 


Alternative hypothesis H: Not H, 


(This indicates that there is a difference between at least two of the k 2 3 population means. 
It is important to note that the alternative hypothesis should not be written as follows: 
Hy: p * p, * m. The reason why the latter notation for the alternative hypothesis is incorrect 
is because it implies that all three population means must differ from one another in order to 
reject the null hypothesis. In this book it will be assumed (unless stated otherwise) that the 
alternative hypothesis for the analysis of variance is stated nondirectionally.? In order to reject 
the null hypothesis, the obtained F value must be equal to or greater than the tabled critical F 
value at the prespecified level of significance.) 


IV. Test Computations 


The test statistic for the single-factor within-subjects analysis of variance can be computed 
with either computational or definitional equations. Although definitional equations reveal the 
underlying logic behind the analysis of variance, they involve considerably more calculations than 
do the computational equations. Because of the latter, computational equations will be employed 
in this section to demonstrate the computation of the test statistic. The definitional equations for 
the single-factor within-subjects analysis of variance are described in Section VII. 

The data for Example 24.1 are summarized in Table 24.1. In Table 24.1 the k 2 3 scores 
of the n = 6 subjects in Conditions 1, 2, and 3 are, respectively, listed in the columns labelled 
X,, X,, and X,. The notation n is employed to represent the number of scores in each of the 
experimental conditions. Since there are n = 6 scores in each condition, n = n, = n, = n, =6. 
The columns labelled x : x , and X list the squares of the scores of the six subjects in each 
of the three experimental conditions. The last column labelled XS, lists for each of the six 
subjects the sum of a subject’s k = 3 scores. Thus, the value ÈS, is the sum of the scores for 
Subject i under Conditions 1, 2, and 3. 
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Table 24.1 Data for Example 24.1 


Condition 1 Condition 2 Condition 3 
"EX ee = -a o 
X, x X, X; X, x ES, 
Subject1 9 81 7 49 4 16 20 
Subject2 10 100 8 64 7 49 25 
Subject3 7 49 5 25 3 9 15 
Subject4 10 100 8 64 7 49 25 
Subject5 7 49 5 25 2 4 14 
Subject6 8 64 6 36 6 36 20 
EX =51 EX? =443 205 230 EX; =263 5X, 209 XX; =163 XX 119 
- xX zv. X S X. 
Reo es aes! «ono unc ees quel on rug 
n 6 n, 6 n, 


The notation N represents the total number of scores in the experiment. Since there are 
n = 6 subjects and each subject has k = 3 scores, there are a total of nk = N = (6)(3) = 18 scores. 
The value XX, represents the sum of the N = 18 scores (i.e., the total sum of scores). Thus: 
EX. = EX, + ÈX, +e + EX, 
Since there are k = 3 experimental conditions, XX, = 119. 


EX, = EX, + EX, EX, = 51 + 39 + 29 = 119 


The value XX, can also be computed by adding up the YS, scores computed for the n 
subjects. Thus: 


EX, = ES, + ES, toe + ES, 


DDS + ES ES TS, 4 ES 4 ES, 


20 + 25 + 15 + 25 + 14 + 20 = 119 


X, represents the grand mean, where X, = XX,/N. Thus, X, = 119/18 =6.61. Although 
X, is not employed in the computational equations to be described in this section, it is employed 
in some of the definitional equations described in Section VII. 

The value xx represents the total sum of the N squared scores. Thus: 


XX) = EX? + EX oce + EX? 
Since there are k = 3 experimental conditions, DE = 869. 
XX; = EX? + XX; + XX, = 443 + 263 + 163 = 869 


Although the means for each of the experimental conditions are not required for computing 
the analysis of variance test statistic, it is recommended that they be computed since visual 
inspection of the condition means can provide the researcher with a general idea of whether or 
not it is reasonable to expect a significant result. To be more specific, if two or more of the 
condition means are far removed from one another, it is likely that the analysis of variance will 
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be significant (especially if there are a relatively large number of subjects employed in the 
experiment). Another reason for computing the condition means is that they are required for 
comparing individual conditions with one another, which is something that is often done 
following the analysis of variance on the full set of data. The latter types of comparisons are 
described in Section VI. 

In order to compute the test statistic for the single-factor within-subjects analysis of 
variance, the total variability in the data is divided into a number of different components. 
Specifically, the following variability components are computed: a) The total sum of squares 
which is represented by the notation SS;,; b) The between-conditions sum of squares which is 
represented by the notation SS,... The between-conditions sum of squares is the numerator of 
the equation that represents between-conditions variability (1.e., the equation that represents the 
amount of variability between the means of the k conditions); c) The between-subjects sum of 
squares is represented by the notation SS,,. The between-subjects sum of squares is the 
numerator of the equation that represents between-subjects variability, which is the amount of 
variability between the mean scores of the n subjects (the mean of each subject being the average 
of a subject's k scores); and d) The residual sum of squares is represented by the notation SS... 
The residual sum of squares is the numerator of the equation that represents residual 
variability (i.e., error variability that is beyond the researcher's control). 





Equation 24.1 describes the relationship between SS,, SS,., SS, and SS, 
SS, = SS,. + SS, + SS... (Equation 24.1) 
Equation 24.2 is employed to compute SS... 
XX 
SS, = XX; - = (Equation 24.2) 


Employing Equation 24.2, the value SS; = 82.28 is computed. 
2 
SS, = 869 - aD - 869 - 786.72 = 82.28 


Equation 24.3 is employed to compute S$,.. In Equation 24.3 the notation XX, represents 
the sum of the n scores in the j " condition. Note that in Equation 24.3 the notation n, can be 
employed in place of n, since there are n = n, scores in the j " condition. 


Gy. 


n 


(Xy 
N 


k 
SS,5c = 2 


j=1 


(Equation 24.3) 














The notation 27; , [IEX / n] in Equation 24.3 indicates that for each condition the value (xy n 
is computed, and the latter values are summed for all k conditions. 
With reference to Example 24.1, Equation 24.3 can be rewritten as follows: 
Gx) EXP  Qxy 
+ + 


n n n 


(Ex? 
N 














SSe = 








Substituting the appropriate values from Example 24.1 in Equation 24.3, the value 
SS,c = 40.45 is computed.’ 
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Gr? , G9 , 29| quy 











SS,c = - 827.17 - 786.72 - 40.45 
6 6 6 18 
Equation 24.4 is employed to compute SS, 
2 | (ESY EX, y 
Bec z a ad (Equation 24.4) 
i=l 








The notation 17 , (XS, Y!/k| in Equation 24.4 indicates that for each subject the value 
QS, Y/k i is computed, and the latter values are summed for all n subjects. With reference to 
Example 24.1, Equation 24.4 can be rewritten as follows: 


(ES) ESY ESY ESQ ESF ESY 
k k k k k k 


(Ex, 
N 




















SSps = 








Substituting the appropriate values from Example 24.1 in Equation 24.4, the value 
SS, = 36.93 is computed.* 


-|C , 05? , 157 , CS? , Q4» , QUY 
ur Gb sS CD E 


SS = 823.65 - 786.72 = 36.93 


BS 








i9 
18 


By algebraically transposing the terms in Equation 24.1, the value of SS... can be computed 
with Equation 24.5. 


S$,. = SS, - SS 


BC 


- SS 


Bs (Equation 24.5) 


T 


Employing Equation 24.5, the value SS... = 4.9 is computed. 
SS es = 82.28 - 40.45 - 36.93 = 4.9 


Equation 24.6 is a computationally more complex equation for computing the value of 


SS 


res ^ 


= "m Qs 


k 


(xy 


k 
SS s = De Y LL N 


j=l 


(Equation 24.6) 

















i=l 
Since EX? = 869, >i 1X; Y/n] = 827.17, X7 ,[QCS,/k] = 823.65, and (CX/N 
= 786.72, employing Pqustion 24. 6, the value SS... = 4.9 is computed. 


SS,,, = 869 - 827.17 - 823.65 + 786.72 = 4.9 


The reader should take note of the fact that the values SS,, SS,., SS,,, and SS e must 
always be positive numbers. If a negative value is obtained for any of the aforementioned values, 
it indicates a computational error has been made. 

At this point the values of the between-conditions variance, between-subjects variance, 
and the residual variance can be computed. In the single-factor within-subjects analysis 


of variance, the between-conditions variance is referred to as the mean square between- 
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conditions, which is represented by the notation MS,.. MS, is computed with Equation 24.7. 


SSpo 


MS pc = a, 
BC 


(Equation 24.7) 


The between-subjects variance is referred to as the mean square between-subjects, 
which is represented by the notation MS,.. MS,, is computed with Equation 24.8. 


SS 
MS,, = — (Equation 24.8) 
Ups 


The residual variance is referred to as the mean square residual, which is represented by 
the notation MS e- MS... is computed with Equation 24.9. 


ms = Se (Equation 24.9) 
= uation * 
res df... q 





Note that a total mean square is not computed. 

In order to compute MS,c, MS,., and MS,.., it is required that the values df,., df,., and 
df... (the denominators of Equations 24.7—24.9) be computed. df,,, which represents the 
between-conditions degrees of freedom, are computed with Equation 24.10. 


dfc = k-1 (Equation 24.10) 


df,,, which represents the between-subjects degrees of freedom, are computed with 
Equation 24.11. 


df -n-1 (Equation 24.11) 


df... Which represents the residual degrees of freedom, are computed with Equation 
24.12. 


df, = (n - Ik - 1) (Equation 24.12) 


Although it is not required in order to determine the F ratio, the total degrees of freedom 
are generally computed, since it can be used to confirm the df values computed with Equations 
24.10-24.12, as well as the fact it is employed in the analysis of variance summary table. The 
total degrees of freedom (represented by the notation df;.), are computed with Equation 24.13. 


df,-nk-1-N-1 (Equation 24.13) 


The relationship between dfc. dfps, df... and df; is described by Equation 24.14. 


df, = Bigg + Bigg * df, (Equation 24.14) 


Employing Equations 24.10-24.13, the values dfc = 2, dfs = 5, df... = 10, and 


df, = 17 are computed. Note that df, = dfc + df, + dfe = 2 +5 + 10 = 17. 
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dí; -3-1-2  d-6-1-5 


d,-(66-D3-D-10 df, =18-1-=17 


Employing Equations 24.7—24.9, the values MS,. = 20.23, MS,, = 7.39, and MS e = .49 
are computed. 


40.45 36.93 49 46 
10 


MS yo = > = 20.23 MSps = => = 7.39 MS, = 


The F ratio, which is the test statistic for the single-factor within-subjects analysis of 
variance, is computed with Equation 24.15. 


MS pc 
MS 


res 


F = (Equation 24.15) 





Employing Equation 24.15, the value F = 41.29 is computed. 


F = 20:23 _ 41.29 
49 


The reader should take note of the fact that the values MS,., MS,,, and MS es must always 
be positive numbers. If a negative value is obtained for any of the aforementioned values, it 
indicates a computational error has been made. If MS... = 0, Equation 24.15 will be insoluble. 
If all of the conditions have the identical mean value, MS,,. = 0, and if the latter is true, F = 0. 


V. Interpretation of the Test Results 


It is common practice to summarize the results of a single-factor within-subjects analysis of 
variance with the summary table represented by Table 24.2. 


Table 24.2. Summary Table of Analysis of Variance 


for Example 24.1 
Source of variation SS df MS F 
Between-subjects 36.93 5 7.39 
Between-conditions 40.45 2 20.23 41.29 
Residual 4.90 10 49 
Total 82.28 17 


The obtained value F = 41.29 is evaluated with Table A10 (Table of the F Distribution) 
in the Appendix. In Table A10 critical values are listed in reference to the number of degrees 
of freedom associated with the numerator and the denominator of the F ratio (i.e., df um and 
df -,)- In employing the F distribution in reference to Example 24.1, the degrees of freedom for 
the numerator are df,,. = 2 and the degrees of freedom for the denominator are df... = 10. In 
Table A10 the tabled F}; and F values are, respectively, employed to evaluate the non- 
directional alternative hypothesis H,: Not H, at the .05 and .01 levels. As is the case for the 
single-factor between-subjects analysis of variance, the notation F ,, is employed to represent 


the tabled critical F value at the .05 level. The latter value corresponds to the relevant tabled 
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F,, value in Table A10. In the same respect, the notation F, is employed to represent the 
tabled critical F value at the .01 level, and corresponds to the relevant tabled F,, value in 
Table A10. 

For df. = 2 and dfin = 10, the tabled F, and F,, values are F,, = 4.10 and 
F = 7.56. Thus, F,, = 4.10 and F = 7.56. In order to reject the null hypothesis, the 
obtained F value must be equal to or greater than the tabled critical value at the prespecified level 
of significance. Since F = 41.29 is greater than both F, = 4.10 and F,, = 7.56, the alterna- 
tive hypothesis is supported at both the .05 and .01 levels. 

A summary of the analysis of Example 24.1 with the single-factor within-subjects 
analysis of variance follows: It can be concluded that there is a significant difference between 
at least two of the three experimental conditions (i.e., different levels of noise). This result can 
be summarized as follows: F(2,10) = 41.29, p < .01. 


VI. Additional Analytical Procedures for the Single-Factor 
Within-Subjects Analysis of Variance and/or Related Tests 


1. Comparisons following computation of the omnibus F value for the single-factor within- 
subjects analysis of variance Prior to reading this section the reader should review the dis- 
cussion of comparison procedures in Section VI of the single-factor between-subjects analysis 
of variance. As is the case with the latter test, the omnibus F value computed for a single-factor 
within-subjects analysis of variance is based on a comparison of the means of all k 
experimental conditions. Thus, in order to reject the null hypothesis, it is only required that the 
means of at least two of the k conditions differ significantly from one another.’ 

The same procedures that are employed for conducting comparisons for the single-factor 
between-subjects analysis of variance can be used for the single-factor within-subjects 
analysis of variance. Thus, the following comparison procedures discussed under the latter test 
can be employed for conducting comparisons within the framework of the single-factor within- 
subjects analysis of variance: Test 24a: Multiple ¢ tests/Fisher's LSD test (which is equiv- 
alent to linear contrasts); Test 24b: The Bonferroni-Dunn test; Test 24c: Tukey's HSD test; 
Test 24d: The Newman-Keuls test; Test 24e: The Scheffé test; Test 24f: The Dunnett test. 
The only difference in applying the aforementioned comparison procedures to the analysis of 
variance under discussion is that a different measure of error variability is employed. Recollect 
that in the case of the single-factor between-subjects analysis variance, a pooled measure of 
within-groups variability (MS...) is employed as the measure of experimental error. In the case 
of the single-factor within-subjects analysis of variance, MS... is employed as the measure of 
error variability. Consequently, in conducting comparisons for a single-factor within-subjects 
analysis of variance, MS „ is employed in place of MS, as the error term in the comparison 
equations described in Section VI of the single-factor between-subjects analysis variance. It 
should be noted, however, that if the sphericity assumption (which, as noted in Section I, is based 
on homogeneity of the underlying population variances and covariances) of the single-factor 
within-subjects analysis of variance is violated, MS... may not provide the most accurate 
measure of error variability to employ in conducting comparisons. Because of this, an alternative 
measure of error variability (that is not influenced by violation of the sphericity assumption) will 
be presented later in this section. 

At this point, employing MS... as the error term, the following two single degree of 
freedom comparisons will be conducted: a) The simple comparison Condition 1 versus 
Condition 2, which is summarized in Table 24.3; and b) The complex comparison Condition 
3 versus the combined performance of Conditions 1 and 2, which is summarized in Table 24.4. 
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Table 24.3 Planned Simple Comparison: Condition 1 Versus Condition 2 


Squared 
Coefficient Product Coefficient 

v v 2 

Condition X, (65) (c) (X) (cj) 
1 8.5 +1 (+1)(8.5) = +8.5 1 
2 6.5 -1 (-1)(6.5) = -6.5 1 

3 4.83 0 (0)(4.83) = 0 0 

Xe, = 0 X(c)X) = 2 Ec =2 


Since it will be assumed that the above comparisons are planned prior to collecting the data, 
linear contrasts will be conducted in which no attempt is made to control the value of the 
familywise Type I error rate («,,,). The null hypothesis and nondirectional alternative 
hypothesis for the above comparisons are identical to those employed when the analogous com- 
parisons are conducted in Section VI for the single-factor between-subjects analysis variance 
(i.e., Hy: p, = p, versus H,: p, * m, for the simple comparison, and Hy: p, = (p; + m,)/2 
versus Hi: u, * (ug, + m,)/2 for the complex comparison). 


Table 24.4 Planned Complex Comparison: Condition 3 Versus Conditions 1 and 2 


Squared 
Coefficient Product Coefficient 

v v 2 

Condition X; (6j) CX) (cj) 
1 8.5 zd je» = 425 1 
2 2 4 
2 6.5 my |: 146.5) = -3.25 = 
2 2 4 
3 4.83 +1 (+1)(4.83) = +4.83 1 

Xc, =0 EX) = -2.67 Ec =1.5 


Equations 21.17 and 21.18, which are employed to compute SS comp and MS comp for linear 
contrasts for the single-factor between-subjects analysis of variance, are also employed for 
linear contrasts for the single-factor within-subjects analysis of variance. The latter equations 


are employed below for the simple comparison Condition 1 versus Condition 2. 


See Sa 154 
Ec? 2 
SS 
Mi scum I1. 
P di ab 1 


Equation 24.16 (which is identical to Equation 21.19, except for the fact that it employs MS... 
as the error term) is used to compute the value of F 


comp * 
= MSomp 2 12 
Fom = QS = — = 2449 (Equation 24.16) 
P MS .49 
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The degrees of freedom employed in evaluating the obtained value P uon - 24.49 are 
Boum = Vo omp = 1 (since in a single degree of freedom comparison d comp Will always equal 
1) and dfin = df.. = 10. For df, = 1 and dfin = 10, the tabled critical .05 and .01 
F values in Table A10 are F, = 4.96 and F,, = 10.04. Since the obtained value 
E ens - 24.49 is greater than both of the aforementioned critical values, the nondirectional 
alternative hypothesis H,: p, * p, is supported at both the .05 and .01 levels. 

Applying Equations 21.17, 21.18, and 24.16 to the complex comparison Condition 3 versus 


Conditions 1 and 2, the following result is obtained. 


Asis the case for the simple comparison, the degrees of freedom employed for the complex 
comparison in evaluating the value F omp = 58.20 are df, = df op" land df, = df... = 10. 


Since the obtained value T uno = 58.20 is greater than the tabled critical values F y; = 4.96 and 
F ,, = 10.04, the nondirectional alternative hypothesis H,: m, * (pu, + p,)/2 is supported at 
both the .05 and .01 levels. 

For both of the comparisons that have been conducted, a CD value can be computed that 
identifies a minimum required difference in order for two means (or sets of means) to differ from 
one another at a prespecified level of significance. For both the simple comparison Condition 
1 versus Condition 2 and the complex comparison Condition 3 versus Conditions 1 and 2, a CD 
value will be computed employing multiple ¢ tests/Fisher’s LSD test (which is equivalent to a 
linear contrast in which the value of c, is not controlled) and the Scheffé test. Whereas 
multiple ¢ tests/Fisher's LSD test allow a researcher to determine the minimum CD value 
(designated CD, sp) that can be computed through use of any of the comparison procedures 
that are described for the single-factor between-subjects analysis of variance, the Scheffé test 
generally results in the maximum CD value (designated CD.) that can be computed by the 
available procedures. Thus, if at a prespecified level of significance the obtained difference for 
a comparison is equal to or greater than CD,, it will be significant regardless of which com- 
parison procedure is employed. On the other hand, if the obtained difference for a comparison 
is less than CD, s, it will not achieve significance, regardless of which comparison procedure 
is employed. In illustrating the computation of CD values for both simple and complex com- 
parisons, it will be assumed that the total number of comparisons conducted is c = 3.'? It will 
also be assumed that when the researcher wants to control the value of the familywise Type I 
error rate, the value à, = .05 is employed irrespective of which comparison procedure is 
used. 

The equations employed for computing the values CD, şp and CD, are essentially the same 
as those used for the single-factor between-subjects analysis of variance. Equations 24.17 and 
24.18 are employed below to compute the values of CD, sp and CD, for the simple comparison 
Condition 1 versus Condition 2. Note that Equations 24.17 and 24.18 are identical to Equations 
21.24 and 21.32, except for the fact that MS __ is employed as the error term in both equations, 


and in Equation 24.18 df, is used in place of df, in determining the numerator degrees of 
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freedom. In employing Equation 24.18, the tabled critical value F9, = 4.10 is employed since 
df am = dfgc = 2 and sen = Tes = 10 


Di sy = Fase) 4 = (4.96 oe) c) .90 (Equation 24.17) 


2MS.. 
T DEt | a 
- JG DAIO oe -1.16 


Thus, if one employs multiple ¢ tests/Fisher's LSD test (i.e., conducts linear contrasts 
with ,,, not adjusted), in order to differ significantly at the .05 level, the means of any two 
conditions must differ from one another by at least .90 units. If, on the other hand, the Scheffé 
test is employed, in order to differ significantly the means of any two conditions must differ from 
one another by at least 1.16 units. Since the difference score for the comparison Condition 1 
versus Condition 2 equals X, - X, = 2,thecomparisonis significant at the .05 level, regardless 
of which comparison procedure is employed. 

Table 24.5 summarizes the differences between pairs of means involving all of the exper- 
imental conditions. Since the difference scores for all three simple/pairwise comparisons are 
greater than CD, = 1.16, all of the comparisons are significant at the .05 level, regardless of 
which comparison procedure is employed. 








(Equation 24.18) 


Table 24.5 Differences Between Pairs of Means in Example 24.1 


X% 0854553 
i- X, = 8.5 - 483 = 3.67 
X, = 6.5 - 483 = 1.67 


The complex comparison Condition 3 versus Conditions 1 and 2 is illustrated below em- 
ploying Equations 24.19 and 24.20. The latter two equations (which are the generic forms of 
Equations 24.17 and 24.18 that can be used for both simple and complex comparisons) are 
identical to Equations 21.25 and 21.33 (except for the use of MS... as the error term, and the 
use of df, in place of df, in Equation 24.20). 


(Ec d 
Disp =y Fa, res) 


(Equation 24.19) 
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(Equation 24.20) 
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Thus, if one employs multiple ¢ tests/Fisher's LSD test (i.e., conducts a linear con- 
trast with &4,,, not adjusted), in order to differ significantly the difference between X, and 
(X, + X,)/2 must be at least .78 units. If, on the other hand, the Scheffé test is employed, in 
order to differ significantly the two sets of means must differ from one another by at least 1.00 
unit. Since the obtained difference of 2.67 is greater than CD, = 1.00, the nondirectional 
alternative hypothesis H,: m, * (gu, + m,)/2 is supported, regardless of which comparison 
procedure is employed. 


The computation of a confidence interval for a comparison The procedure that is described 
for computing a confidence interval for a comparison for the single-factor between-subjects 
analysis of variance can also be used with the single-factor within-subjects analysis of vari- 
ance. Thus, in the case of the latter analysis of variance, a confidence interval for a comparison 
is computed by adding to and subtracting the relevant CD value for the comparison from the 
obtained difference between the means involved in the comparison. As an example, let us 
assume the Scheffé test is employed to compute the value CD, - 1.16 for the comparison 
Condition 1 versus Condition 2. To compute the 95% confidence interval, the value 1.16 is 
added to and subtracted from 2, which is the absolute value of the difference between the two 
means. Thus, CJ, = 2 + 1.16, which can also be written as .84 < (u, - m) < 3.16. In 
other words, the researcher can be 95% confident (or the probability is .95) that the mean of the 
population represented by Condition 1 is between .84 and 3.16 units larger than the mean of the 
population represented by Condition 2. 


Alternative methodology for computing MS,,, for a comparison Earlier in this section it is 
noted that if the sphericity assumption underlying the single-factor within-subjects analysis of 
variance is violated, MS... may not provide an accurate measure of error variability for a 
specific comparison. Because of this, many sources recommend that when the sphericity 
assumption is violated a separate measure of error variability be computed for each comparison 
that is conducted. The procedure that will be discussed in this section (which is described in 
Keppel (1991)) can be employed any time a researcher has reason to believe that MS... employed 
in computing the omnibus F value is not representative of the actual error variability for the ex- 
perimental conditions involved in a specific comparison. The procedure (which will be 
demonstrated for both a simple and complex comparison) requires that a single-factor within- 
subjects analysis of variance be conducted employing only the data for those experimental 
conditions involved in a comparison. In the case of a simple comparison, the scores of subjects 
in the two comparison conditions are evaluated with the analysis of variance. In the case of a 
complex comparison, a weighted score must be computed for each subject for any composite 
mean that is a combination of two or more experimental conditions. 

With respect to a simple planned comparison, the procedure will be employed to evaluate 
the difference between Condition 1 and Condition 3. The reason it will not be used for the 
Condition 1 versus Condition 2 comparison is because the error variability associated with the 
latter comparison is MS e, = 0. The reason why MS... = 0 for the latter comparison is revealed 
by inspection of the scores of the six subjects in Table 24.1. Observe that the score of each of 
the six subjects in Condition 1 is two units higher than it is in Condition 2. Anytime all of 
the subjects in a within-subjects design involving two treatments obtain identical difference 
scores, the measure of error variability will equal zero. Consequently, whenever the value of 
MS, = 0, the value of P ann will be indeterminate, since the denominator of the F ratio will 
equal zero. In such an instance, one can either use the MS „ value employed in computing the 
omnibus F value, or elect to use the smallest MS... value that can be computed for any of the 
other simple comparisons in the set of data." 
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Table 24.6 summarizes the data for the comparison Condition 1 versus Condition 3. Note 
that the values XX, = 80 and XX - 606 differ from those computed in Table 24.1, since in 
this instance they only include the data for two of the three experimental conditions. 


Table 24.6 Data for Comparison of Condition 1 Versus Condition 3 


Condition 1 Condition 3 

2 2 
X, X; X, X; XS. 
Subject 1 9 81 4 16 13 
Subject 2 10 100 7 49 17 
Subject 3 7 49 3 9 10 
Subject 4 10 100 7 49 17 
Subject 5 7 49 2 4 9 
Subject 6 8 64 6 36 14 
XX = 51 EX; = 443 XX,-29 XX; = 163 XX, = 80 

DX, = 51 + 29 = 80 UX; = 443 + 163 = 606 


The sum of squares values that are required for the analysis of variance are computed with 
Equations 24.2—24.5. Note that since we are only dealing with two conditions, k = 2, and thus, 
N = nk = (6)(2) = 12. 
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SS. = $8, - SSyc - SSgg = 72.67 - 40.34 - 28.67 = 3.66 


Upon computing the sum of squares values, the appropriate degrees of freedom (employing 
Equations 24.10-24.13) and mean square values (employing Equations 24.7—24.9) are computed. 
Employing Equation 24.15, the value F = F omp = 55.26 is computed for the comparison. 
The analysis of variance is summarized in Table 24.7. Note that in computing the degrees of 


freedom for the analysis, the values n = 6 and k = 2 are employed. 


Table 24.7 Summary Table for Analysis of Variance for 
Comparison of Condition 1 Versus Condition 3 


Source of variation SS df MS F 
Between-subjects 28.67 5 5.73 
Between-conditions 40.34 1 40.34 55.26 
Residual 3.66 5 73 

Total 72.67 11 


Employing df. = dfgc = © ra = 1 and df,.,, = dfi, = 5, the tabled critical values 


employed in Table A10 are F, = 6.61 and F,, = 16.26. Since the obtained value F = 55.26 
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is greater than both of the aforementioned critical values, the nondirectional alternative 
hypothesis H,: s, * p, is supported at both the .05 and .01 levels. The reader should note that 
the value df... = 5 employed for the comparison of Condition 1 versus Condition 3 is less than 
the value df... = 10 employed when the same comparison was conducted earlier in this section 
employing the value MS... = -49 obtained for the omnibus F test. The lower df... value 
associated with the method under discussion results in a less powerful test of the alternative 
hypothesis. In some instances the loss of power associated with this method may be offset if the 
value of MS... for the comparison conditions is smaller than the value of MS... obtained for the 
omnibus F test. 

Employing Equations 24.17 and 24.18, a CD 


Condition 1 versus Condition 3 comparison. 


2MS es 2)(.73 
Diso = Fu | = 6l S -127 


2MS 
CD, = E DF a4) LR VG - DEO) Oe» = doti 


In computing CD, s, = 1.27, the tabled critical F value employed in Equation 24.17 is 
based on df... = Dac = an = 1,and drien = 4... = 5. Note that df, = dfs = 5 is only 
based on the data for the two experimental conditions employed in the comparison. In computing 
CD, = 1.41, however, the tabled critical F value employed in Equation 24.18 is based on all 
three conditions employed in the experiment, and thus df, = 2 and df... = 10. Note that the 
values CD, ., = 1.27 and CD, = 1.41 are larger than the corresponding values CD,.,, = .90 
and CD, = 1.16, which are computed for simple comparisons when MS e, = -49 is employed. 
It should be obvious through inspection of Equations 24.17 and 24.18, that the larger the value 
of MS... the greater the magnitude of the computed CD value. 

At the beginning of this section it is noted that when a complex comparison is conducted 
employing a residual variability which is based only on the specific conditions involved in the 
comparison, a weighted score must be computed for each subject for any composite mean that 
is a combination of two or more experimental conditions. This will now be illustrated for the 
comparison of Condition 3 versus the combined performance of Conditions 1 and 2. 


zsp and CD, value can be computed for the 








Table 24.8 Data for Comparison of Condition 3 Versus Conditions 1 and 2 


Condition 3 Conditions 1 & 2 

2 2 
X, X; Xin Xin YS, 
Subject 1 4 16 8 64 12 
Subject 2 7 49 9 81 16 
Subject 3 3 9 6 36 9 
Subject 4 7 49 9 81 16 
Subject 5 2 4 6 36 8 
Subject 6 6 36 7 49 13 
EX, = 29 EX; =163 EX p=45 EX, =347 EX, = 74 

EX, = 29 + 45 = 74 XX; = 163 + 347 = 510 


Table 24.8 contains the scores of the six subjects in Condition 3 and a weighted score for 
each subject for Conditions 1 and 2. The weighted score of a subject for a combination of two 
or more conditions is obtained as follows: a) A subject's score in each condition is multiplied by 
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the absolute value of the coefficient for that condition (based on the coefficients for the com- 
parison in Table 24.4); and b) The subject's weighted score is the sum of the products obtained 
in part a). 

To clarify how the aforementioned procedure is employed to compute the weighted scores 
in Table 24.8, the computation of weighted scores for Subjects 1 and 2 will be described. Since 
the scores of Subject 1 in Conditions 1 and 2 are 9 and 7, each score is multiplied by the absolute 
value of the coefficient for the corresponding condition. Employing the absolute values of the 
coefficients noted in Table 24.4 for Conditions 1 and 2, each score is multiplied by 1/2, yielding 
(9)(1/2) = 4.5 and (7)(1/2) = 3.5. The weighted score for Subject 1 is obtained by summing the 
latter two values. Thus, the weighted score for Subject 1 is 4.5 + 3.5 2 8. In the case of Subject 
2, (10)(1/2) = 5 and (8)(1/2) = 4. Thus, the weighted score for Subject 2 is 5 - 4 2 9. The same 
procedure is used for the remaining four subjects.’ 

The sum of squares values that are required for the analysis of variance are computed with 
Equations 24.2—24.5. Note that since we are only dealing with two sets of means, k = 2, and 
thus, N = nk = (6)(2) = 12. 
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SS. = $9, - SSyc - SSgg = 53.67 - 21.34 - 28.67 = 3.66 

Upon computing the sum of squares values, the appropriate degrees of freedom (employing 
Equations 24.10—24.13) and mean square values (employing Equations 24.7—24.9) are computed. 
Employing Equation 24.15, the value F = F.,,., = 29.23 is computed for the comparison. The 


analysis of variance is summarized in Table 24.9. Note that in computing the degrees of freedom 
for the analysis, the values n = 6 and k = 2 are employed. 


Table 24.9 Summary Table for Analysis of Variance 
for Comparison of Condition 3 Versus Conditions 1 and 2 


Source of variation SS df MS F 
Between-subjects 28.67 5 5.73 
Between-conditions 21.34 1 21.34 29.23 
Residual 3.66 5 73 

Total 53.67 11 


Employing df am = Src = df ooi = 1 and dfi, = dfi, = 5, the tabled critical values 
employed in Table A10 are F, = 6.61 and F,, = 16.26. Since the obtained value F = 29.23 
is greater than both of the aforementioned tabled critical values, the nondirectional alternative 
hypothesis H,: p, * (p; + p,)/2 is supported at both the .05 and .01 levels. 

Employing Equations 24.19 and 24.20, a CD, ,, and CD, value can be computed for the 
complex comparison. 
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Note that the values CD, ., = 1.10 and CD, = 1.22 are larger than the corresponding 
values CD,., = .78 and CD, = 1.00, which are computed for the same complex comparison 
when MS, = -49 is used. 


(Ec? MS,..) 


Tes 





2. Comparing the means of three or more conditions when k > 4 Within the framework of 
a single-factor within-subjects analysis of variance involving k = 4 or more conditions, a 
researcher may wish to evaluate a general hypothesis with respect to the means of a subset of 
conditions, where the number of conditions in the subset is some value less than k. Although the 
latter type of situation is not commonly encountered in research, this section will describe the 
protocol for conducting such an analysis. Specifically, the protocol described for the analogous 
analysis for a single-factor between-subjects analysis of variance will be extended to the 
single-factor within-subjects analysis of variance. 

To illustrate, assume that a fourth experimental condition is added to Example 24.1. 
Assume that the scores of Subjects 1-6 in Condition 4 are respectively: 3, 8, 2, 6, 4, 6. Thus, 
YX, = 29, X, = 4.83, and XX - 165. If the data for Condition 4 are integrated into the 
data for the other three conditions (which are summarized in Table 24.1), the following summary 
values are computed: N = nk = (6)(4) = 24, XX. = 148, xx - 1034. Substituting the revised 
values for k = 4 conditions in Equations 24.2—24.5, the following sum of squares values are 
computed: SS, = 121.33, SS,. = 54.66, SS,. = 54.33, SS... = 12.34. Employing the values 
k = 4 and n = 6 in Equations 24.10-24.12, the values df,, = 4- 1 = 3, dfj; = 6- 1 = 5, and 
df... = (6 — 1)(4 - 1) = 15 are computed. Substituting the appropriate values for the sums of 
squares and degrees of freedom in Equations 24.7—24.9, the values MS,. = 54.66/3 = 18.22, 
MS,; = 54.33/5 = 10.87,and MS, = 12.34/15 = .82 are computed. Equation 24.15 is em- 
ployed to compute the value F = 18.22/.82 = 22.22. Table 24.10 is the summary table of the 
analysis of variance. 


Table 24.10 Summary Table of Analysis of Variance 
for Example 24.1 When k = 4 


Source of variation SS df MS F 
Between-subjects 54.33 5 10.87 
Between-conditions 54.66 3 18.22 22.22 
Residual 12.34 15 .82 

121.33 23 


Employing df um =3 and dfin = 15, the tabled critical .05 and .01 F values are F y; = 3.29 
and F, — 5.42. Since the obtained value F = 22.22 is greater than both of the aforementioned 
critical values, the null hypothesis (which fork 2 4is Hy: p, = M, = p, = p4)can be rejected 
at both the .05 and .01 levels. 
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Let us assume that prior to the above analysis the researcher has reason to believe that 
Conditions 1, 2, and 3 may be distinct from Condition 4. However, before he contrasts the 
composite mean of Conditions 1, 2, and 3 with the mean of Condition 4 (i.e., conducts the 
complex comparison which evaluates the null hypothesis Hy: (u, + p, + m,)/3 = p,), he 
decides to evaluate the null hypothesis Hy: pu, = p, = p4. If the latter null hypothesis is 
retained, he will assume that the three conditions share a common mean value, and on the basis 
of this he will compare their composite mean with the mean of Condition 4. In order to evaluate 
the null hypothesis Hy: p; = p, = p4, it is necessary for the researcher to conduct a separate 
analysis of variance that just involves the data for the three conditions identified in the null 
hypothesis. The latter analysis of variance has already been conducted, since it is the original 
analysis of variance that is employed for Example 24.1 — the results of which are summarized 
in Table 24.2. 

Upon conducting an analysis of variance on the data for all k 2 4 conditions as well as an 
analysis of variance on the data for the subset comprised of K abse = 3 conditions, the researcher 
has the necessary information to compute the appropriate F ratio (which will be represented with 
the notation F, 12/5) for evaluating the null hypothesis Hy: gu, = p, = p. If we apply the same 
logic employed when the analogous analysis is conducted in reference to the single-factor 
between-subjects analysis of variance, the following values are employed to compute the F 
ratio to evaluate the latter null hypothesis: a) MS,. = 20.23 (which is the value of MS, com- 
puted for the analysis of variance in Table 24.2 that involves only the three conditions identified 
in the null hypothesis H: p, = p, = p4) is employed as the numerator of the F ratio; and 
b) MS... = -82 (which is the value of MS... computed in Table 24.10 for the omnibus F test 
when the data for all k = 4 conditions are evaluated) is employed as the denominator of the F 
ratio. The use of MS... = .82 instead of MS... = -49 (which is the value of MS... computed 
for the analysis of variance in Table 24.2) as the denominator of the F ratio is predicated on the 


assumption that MS... = .82 provides a more accurate estimate of error variability than 


MS, = .49. If the latter assumption is made, the value Fans) = 24.67 is computed. 
BC 20.23 
Fan) = S "n 24.67 
I€5(1/5/3/4) 


The degrees of freedom employed for the analysis are based on the mean square values 
employed in computing the F (2/3) ratio. Thus: dfi, = Keser — 1 = 3 — 1 = 2 (where 
= 3 conditions) and dfin = df, — 15 (which is df... for the omnibus F test 
involving all k = 4 conditions). For Gum = 2 and dfi, = 15, Fos = 3.68 and Fy, = 6.36. 
Since the obtained value F = 24.67 is greater than both of the aforementioned critical values, the 
null hypothesis can be rejected at both the .05 and .01 levels. Thus, the data do not support the 
researcher’s hypothesis that Conditions 1, 2, and 3 represent a homogeneous subset. In view of 
this, the researcher would not conduct the contrast (X, + X, + X,)/3 versus X,. 

If a researcher is not willing to assume that MS... = -82 provides a more accurate estimate 
of error variability than MS... = .49, the latter value can be employed as the denominator term 
in computing the F (1/2/3) ratio. Thus, if for some reason a researcher believes that by virtue of 
adding a fourth experimental condition experimental error is either increased or decreased, one 
can justify employing the value MS... = .49 (computed for the k = 3 conditions) as the de- 


res 


nominator term in computing the value F,,. If MS... = -49 is employed to compute the F 


ratio, the value F5, = 20.23/.49 - 41.29 is computed. Since the latter value is greater than 


F 4; = 3.68 and F,, = 6.36, the researcher can still reject the null hypothesis at both the .05 


and .01 levels. However, the fact that F (1/2/3) = 41.29 is substantially larger than F, unm) ^ 24.67 


subset 
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illustrates that depending upon which of the two error terms is employed, it is possible that they 
may lead to different conclusions regarding the status of the null hypothesis. The determination 
of which error term to use will be based on the assumptions a researcher is willing to make 
concerning the data. In the final analysis, in instances where the two error terms yield 
inconsistent results, it may be necessary for a researcher to conduct one or more replication 
studies in order to clarify the status of a null hypothesis. 


3. Evaluation of the sphericity assumption underlying the single-factor within-subjects 
analysis of variance In Section I it is noted that one of the assumptions underlying the single- 
factor within-subjects analysis of variance is the existence of a condition referred to as 
sphericity. Sphericity exists when there is homogeneity of variance among the populations of 
difference scores. The latter can be explained as follows: Assume that for each of the n subjects 
who serve under all k experimental conditions, a difference score is calculated for all pairs of 
conditions. The number of difference scores that can be computed for each subject will equal 
[k(k - 1)]/2. When k = 3, three sets of difference scores can be computed. Specifically: a) A set 
of difference scores that is the result of subtracting each subject's score in Condition 2 from the 
subject's score in Condition 1; b) A set of difference scores that is the result of subtracting each 
subject's score in Condition 3 from the subject's score in Condition 1; and c) A set of difference 
scores that is the result of subtracting each subject's score in Condition 3 from the subject's score 
in Condition 2." 

The sphericity assumption states that if the estimated population variances for the three sets 
of difference scores are computed, the values of the variances should be equal. The derivation 
of the three sets of difference scores for Example 24.1, and the computation of their estimated 
population variances are summarized in Table 24.11. Note that for each set of difference scores, 
a D value is computed for each subject. The estimated population variance of the D values 
(which is computed with Equation I.5) represents the estimated population variance of a set of 
difference scores. 

Visual inspection of the estimated population variances of the difference scores reveals that 
the three variances are quite close to one another. This latter fact suggests that the sphericity 
assumption is unlikely to have been violated. Unfortunately, the tests that are discussed in this 
book for evaluating homogeneity of variance are not appropriate for comparing the variances of 
the difference scores within the framework of evaluating the sphericity assumption. The pro- 
cedures that have been developed for evaluating sphericity require the use of matrix algebra and 
are generally conducted with the aid of a computer. Further reference to such procedures will 
be made later in this discussion. 

Sources on analysis of variance (e.g., Myers and Well (1991, 1995)) note that there is 
another condition known as compound symmetry which is sufficient, although not necessary, 
in order for sphericity to exist. Compound symmetry, which represents a special case of spheric- 
ity, exists when both of the following conditions have been met: a) Homogeneity of variance 
— All of the populations that are represented by the k experimental conditions have equal 
variances; and b) Homogeneity of covariance — All of the population covariances are equal 
to one another. 

At this point we will examine the variances of the three experimental conditions, as well 
as the covariances of each pair of conditions. Employing Equation I.5, the estimated population 
variance for each of the three experimental conditions is computed. 
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Table 24.11 Computation of Estimated Population Variances of Difference Scores 


Subject 1 
Subject 2 
Subject 3 
Subject 4 
Subject 5 
Subject 6 


Subject 1 
Subject 2 
Subject 3 
Subject 4 
Subject 5 
Subject 6 


Subject 1 
Subject 2 
Subject 3 
Subject 4 
Subject 5 
Subject 6 
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Condition 1 versus Condition 2 
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Condition 1 versus Condition 3 
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4 5 25 
7 3 9 
3 4 16 
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2 5 25 
6 2 4 
=D -22 XD? = 88 
2 2 
.2 E n E 
Sx xy = VES = aa = 1.47 
Condition 2 versus Condition 3 
X, D D? 
4 3 9 
7 1 1 
3 2 4 
7 1 1 
2 3 9 
6 0 0 
XD = 10 YD? = 24 
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Although Equation 17.9 is not actually employed to evaluate homogeneity of variance 
within the framework of the sphericity assumption, for illustrative purposes it will be used. The 
latter equation is employed to evaluate the homogeneity of variance assumption for the t test for 
two dependent samples by contrasting the highest estimated population variance (which in 
Example 24.1 is 82 7 4.57) and the lowest estimated population variance (which in Example 
24.1 is x = s = 1.9). In order to employ Equation 17.9 it is necessary to compute the 
correlation between subjects’ scores in Condition 1 (which will be used to represent the lowest 
variance) and Condition 3. The value of the correlation coefficient is computed with Equation 
17.7. Employing Equation 17.7, the value ry y = .85 is computed." 
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Substituting the appropriate values in Equation 17.9, the value ¢ = 1.72 is computed. 
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= 1.72 
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The degrees of freedom associated with the t value computed with Equation 17.9 are 
df 2 n 2 z4. Since the computed value t = 1.72 is less than the tabled critical two-tailed value 
tos = 2.78 (for df= 4), the null hypothesis H: o; = o; (which states there is homogeneity of 
variance) is retained. Thus, there is no evidence to suggest that the homogeneity of variance 
assumption is violated. 

Earlier in the discussion it was noted that the sphericity assumption assumes equal popu- 
lation covariances. Whereas variance is a measure of variability of the scores of n subjects on 
a single variable, covariance (which is discussed in more detail in Section VII of the Pearson 
product-moment correlation coefficient (Test 28)) is a measure that represents the degree to 
which two variables vary together. A positive covariance is associated with variables that are 
positively correlated with one another, and a negative covariance is associated with variables that 
are negatively correlated with one another. 

Equation 24.21 is the general equation for computing covariance. The value computed 
with Equation 24.21 represents the estimated covariance between Population a and Population 
b. 


XX)0X 
rix. EEX 
n 


cov = ——————— (Equation 24.21 
X, X, "rr q ) 


Since a covariance can be computed for any pair of experimental conditions, in Example 
24.] three covariances can be computed — specifically, the covariance between Conditions 1 
and 2, the covariance between Conditions 1 and 3, and the covariance between Conditions 2 
and 3. To illustrate the computation of the covariance, the covariance between Conditions 1 
and 2 (covy, x) will be computed employing Equation 24.21. Table 24.12, which reproduces 
the data for Conditions 1 and 2, summarizes the values employed in the calculation of the 
covariance. 
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Table 24.12 Data Required for Computing Covariance 
of Condition 1 and Condition 2 


X X, X X, 

Subject 1 9 7 63 
Subject 2 10 8 80 
Subject 3 7 5 35 
Subject 4 10 8 80 
Subject 5 7 5 35 
Subject 6 8 6 48 
Xx esi EX, = 39 DX, X, = 341 


Employing Equation 24.21 the value cov, y = 1.9 is computed. 
12 


341 6009 
6 


COVy y = EIS T EE -]1.9 


If the relevant data for the other two sets of scores are employed, the values cov, y = 2.5 
173 
and cov, , = 2.5 are computed. 
25:3 


259 - cum 
COVy x, = XT RT RENE 22.5 

201 - (39)(29) 
COV, y = — 22.9 


Since the three values for covariance are extremely close to one another, on the basis of 
visual inspection it would appear that the data are characterized by homogeneity of covariance. 
Coupled with the fact that homogeneity of variance also appears to exist, it would seem reason- 
able to conclude that the assumptions underlying compound symmetry (and thus of sphericity) 
are unlikely to have been violated. 

The conditions necessary for compound symmetry (which, as previously noted, is not 
required in order for sphericity to exist), are, in fact, more stringent than the general requirement 
of sphericity (i.e., that there be homogeneity of variance among the populations of difference 
scores). Whenever data are characterized by compound symmetry, homogeneity of variance will 
exist among the populations of difference scores. However, it is possible to have homogeneity 
of variance among the populations of difference scores, yet not have compound symmetry. 

A full discussion of the tests that are employed to evaluate the sphericity assumption under- 
lying the single-factor within-subjects analysis of variance is beyond the scope of this book. 
The interested reader can find a description of such tests in selected texts that specialize in 
analysis of variance (e.g., Kirk (1982, 1995)). Keppel (1991) among others, notes, however, that 
tests which have been developed to evaluate the sphericity assumption have their own assump- 
tions, and when the assumptions of the latter tests are violated (which may be more often than 
not) their reliability will be compromised. In view of this, Keppel (1991) questions the wisdom 
of employing such tests for evaluating the sphericity assumption. 

At this point some general comments are in order regarding the consequences of violating 
the sphericity assumption. In the discussion of the ¢ test for two dependent samples it is noted 
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that the latter test is much more sensitive to violation of the homogeneity of variance assumption 
than is the £ test for two independent samples (Test 11). Since this observation can be 
generalized to designs involving more than two treatments, the single-factor within-subjects 
analysis of variance is more sensitive to violation of the sphericity assumption than is the single- 
factor between-subjects analysis of variance to violation of its assumption of homogeneity of 
variance. In point of fact, most sources suggest that the single-factor within-subjects analysis 
of variance is extremely sensitive to violations of the sphericity assumption, and that when the 
latter assumption is violated, the tabled critical values in Table A10 will not be accurate. 
Specifically, when the sphericity assumption is violated, the tabled critical F value associated 
with the appropriate degrees of freedom for the analysis of variance will be too low (i.e., the 
Type I error rate for the analysis will actually be higher than the prespecified value). One option 
proposed by Geisser and Greenhouse (1958) is to employ the tabled critical F value 
associated with df. = 1 and dfin = n - 1 instead of the tabled critical F value associated 
with the usual degrees of freedom values (ie. df, = Geo - k - 1 and dfin = df... 
=(n- 1)(k- 1)). However, since the Geisser-Greenhouse method tends to overcorrect the value 
of F (i.e., it results in too high a critical value), some sources recommend an alternative but com- 
putationally more involved method developed by Box (1954) which does not result in as severe 
an adjustment of the critical F value as the Geisser-Greenhouse method. 

Keppel (1991) notes that it is quite common for the sphericity assumption to be violated in 
experiments that utilize a within-subjects design. In view of this he recommends that in em- 
ploying the single-factor within-subjects analysis of variance to evaluate the latter design, it 
is probably always prudent to run a more conservative test in order to insure that the Type I error 
rate is adequately controlled. An even more extreme viewpoint is articulated by other sources 
who recommend that when there is reason to believe that the sphericity assumption is violated, 
one should evaluate the data with a procedure other than the single-factor within-subjects 
analysis of variance. Specifically, these sources (e.g., Maxwell and Delaney (1990) and Howell 
(1992, 1997)) recommend evaluating the data for a within-subjects design with a multivariate 
analysis of variance (MANOVA) (described in Section VII of the single-factor between- 
subjects analysis of variance). 

It should be apparent from the discussion in this section that there is lack of agreement with 
respect to the most appropriate methodology for dealing with violation of the sphericity assump- 
tion. A cynic might conclude that regardless of which method one employs, there will always 
be reason to doubt the accuracy of the probability value associated with the outcome of a study. 
As noted throughout this book, in situations where there are doubts concerning the reliability of 
an analysis, the most powerful tool the researcher has at her disposal is replication. In the final 
analysis, the truth regarding a hypothesis will ultimately emerge if one or more researchers con- 
duct multiple studies that evaluate the same hypothesis. In instances where replication studies 
have been conducted, meta-analysis (discussed in Section IX (the Addendum) of the Pearson 
product-moment correlation coefficient) can be employed to derive a pooled probability value 
for all of the published studies.” 


4. Computation of the power of the single-factor within-subjects analysis of variance Prior 
to reading this section the reader should review the discussion of power in Section VI of 
the single-factor between-subjects analysis of variance, since basically the same procedure 
is employed to determine the power of the single-factor within-subjects analysis of variance. 
The power of the single-factor within-subjects analysis of variance is computed with Equa- 
tion 24.22, which is identical to Equation 21.38 (which is the equation used for computing the 
power of the single-factor between-subjects analysis of variance), except for the fact that the 
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estimated value of o). is employed as the measure of error variability in place of OyG (McFatter 
and Gollob (1986)). 


(Equation 24.22) 





Where: y, = The estimated mean of the population represented by Condition j 


Hr = The grand mean, which is the average of the k estimated population means 


2 : saei 
Oes = The estimated measure of error variability 


n = The number of subjects 
k = The number of experimental conditions 


To illustrate the use of Equation 24.22 with Example 24.1, let us assume that prior to 
conducting the study the researcher estimates that the means of the populations represented by 
the three conditions are as follows: pw, = 8, p, = 6, p, = 4. Additionally, it will be 
assumed that he estimates the population error variance associated with the analysis will equal 


2 


O,, = 1. Based on this information, the value p, can be computed: pp = (u, + p, + m,)/k = 


(8 +6+4)/3 2 6. The appropriate values are now substituted in Equation 24.22. 


ii 


At this point Table A15 (Graphs of the Power Function for the Analysis of Variance) 
in the Appendix can be employed to determine the necessary sample size required in order to 
have the power stipulated by the experimenter. For our analysis (for which it will be assumed 
a = .05) the appropriate set of curves to employ is the set for df am = Oc = 2. Let us assume 
we want the omnibus F test to have a power of at least .80. We now substitute what we consider 
to be a reasonable value for n in the equation $ = 1.63/n (which is the result obtained with 
Equation 24.22). To illustrate, the value n = 6 (the sample size employed in Example 24.1) is 
substituted in the equation. The resulting value is f = 1.63/6 = 3.99. 

The value ọ = 3.99 is located on the abscissa (X-axis) of the relevant set of curves in Table 
A15 — specifically, the set for df, = 2. At the point corresponding to ọ = 3.99, a per- 
pendicular line is erected from the abscissa which intersects with the power curve that 
corresponds to df... = df... employed for the omnibus F test. Since df... = 10, the curve for 
the latter value is employed (or closest to it if a curve for the exact value is not available). Atthe 
point the perpendicular intersects the curve df. = 10, a second perpendicular line is drawn in 
relation to the ordinate (Y-axis). The point at which this perpendicular intersects the ordinate 
indicates the power of the test. Since @ = 3.99, we determine the power equals 1.'° Thus, if we 
employ six subjects in a within-subjects design, there is a 100% likelihood (which corresponds 
to a probability of 1) of detecting an effect size equal to or larger than the one stipulated by the 
researcher (which is a function of the estimated values for the population means relative to the 
value estimated for error variability). Since the probability of committing a Type II error is 
B = 1- power, B21—1 0. This value represents the likelihood of not detecting an effect size 
equal to or greater than the one stipulated. 

Equation 24.23 (described in McFatter and Gollob (1986)) can be employed to conduct a 
power analysis for a comparison associated with a single-factor within-subjects analysis of 


variance. Equation 24.23 is identical to Equation 21.39 (which is the equation for evaluating the 





= /2.67n = 1.63/n 
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power of a comparison for the single-factor between-subjects analysis of variance), except for 
the fact that o. is employed as the measure of error variability in place of Uo: 





is 2 
Pomp = |7 M cH (Equation 24.23) 
2(0,,)0:c)) 


res. 


As is the case for a single-factor between-subjects analysis of variance, Equation 24.23 
can be used for both simple and complex single degree of freedom comparisons. As a general 
rule, the equation is used for planned comparisons. As noted in the discussion of the single- 
factor between-subjects analysis of variance, although the equation can be extended to un- 
planned comparisons, published power tables for the analysis of variance generally only apply 
to per comparison error rates of a = .05 and a = .01. In the case of planned and especially 
unplanned comparisons which involve «,, rates other than .05 or .01, more detailed tables are 
required. 

For single degree of freedom comparisons, the power curves in Table A15 for df... = 1 
are always employed. The use of Equation 24.23 will be illustrated for the simple comparison 
Condition 1 versus Condition 2 (summarized in Table 24.3). Since Xc, = 2, and we have 
estimated w, = p, = 8, p, = p, = 6 and ol. = ], the following result is obtained: 


Z (8 -67| 
Deomp | Oe) 1” 


Substituting n = 6 in the equation Promp = yn, we obtain Deomp = y6 = 2.45. Employing 
the power curves for df am = 1 with a =.05, we use the curve for df... = 10 (df... employed 
for the omnibus F test) and determine that, when d = 2.45, the power of the test is approxi- 


mately .88. 








comp 


5. Measures of magnitude of treatment effect for the single-factor within-subjects analysis 
of variance: Omega squared (Test 24g) and Cohen'sf index (Test 24h) Prior to reading this 
section the reader should review the discussion of measures of magnitude of treatment effect in 
Section VI of both the ¢ test for two independent samples and the single-factor between- 
subjects analysis of variance. The discussion for the latter test notes that the computation of 
an omnibus F value only provides a researcher with information regarding whether the null 
hypothesis can be rejected — i.e., whether a significant difference exists between at least two of 
the experimental conditions. The F value (as well as the level of significance with which it is 
associated), however, does not provide the researcher with any information regarding the size 
of any treatment effect that is present. As is noted in earlier discussions of treatment effect, the 
latter is defined as the proportion of the variability on the dependent variable that is associated 
with the independent variable/experimental conditions. The measures described in this section 
are variously referred to as measures of effect size, measures of magnitude of treatment 
effect, measures of association, and correlation coefficients. 


Omega squared (Test 24g) The omega squared statistic is a commonly computed measure 
of treatment effect for the single-factor within-subjects analysis of variance. Keppel (1991) 
and Kirk (1995) note that there is disagreement with respect to which variance components 
should be employed in computing omega squared for a within-subjects design. One method of 
computing omega squared (which computes a value referred to as standard omega squared) was 
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employed in the previous edition of this book. The latter method expresses treatment (i.e., 
between-conditions) variability as a proportion of the sum of all the elements that account for 
variability in a within-subjects design. Equation 24.25, which is presented in Myers and Well 
(1995), can be employed to compute standard omega squared. (G7). The omega squared 
value computed with Equation 24.25 is an estimate of the proportion of variability in the data that 
is attributed to the experimental treatments (Onc) divided by the sum total of variability in the 
data (i.e., treatment variability (03c) plus between-subjects variability (03s) plus residual 
varbility (02. )). Thus, Equation 24.24 represents the population parameter (Q2) estimated by 
Equation 24.25. Employing Equation 24.25 with the data for Example 24.1, the value @, = 444 
is computed. 


2 


Oo 
T M =. (Equation 24.24) 
One i Ons F Ores? 
e (k - 1)(MS,, - MS pes) (Equation 24.25) 


OQ. I 
*  &— Da - DMS,, + (k - DMS,. + nMSg, 


o = (3 - 1)20.23 - .49) - 
* (3 - 1X6 - C49) + (3 - 100.23) + (6)(7.39) 





The value à? = .44 indicates that 44% (or a proportion of .44) of the variability on the 
dependent variable (the number of nonsense syllables correctly recalled) is associated with 
variability on the levels of the independent variable (noise). 

A second method for computing omega squared computes what is referred to as partial 
omega squared. The latter measure, which Keppel (1991) and Kirk (1995) view as more mean- 
ingful than standard omega squared, ignores between-subjects variability, and expresses 
treatment (i.e., between-conditions) variability as a proportion of the sum of between- conditions 
and residual variability. Equation 24.27 is employed to compute partial omega squared (@, ). 
Equation 24.26 represents the PR parameter (@, ) estimated by Equation 24. 27. 
Employing Equation 24.27, the value @, .82 is computed. 


2 
o 
o», - ——— (Equation 24.26) 
OBC RE Ores 
E 
à, = ——— (Equation 24.27) 
Opc t Õies 


Where: 


"n pc (MS, - MS.) — (2)(20.23 - 49) _ 2.19 
BC nk (6G) 





Thus: à; = ———— = —————_ _ = 82 
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Equation 24.28 can also be employed to compute the value of partial omega squared. 


oe - &- DEF-N1 
P (k-DF- 1) + nk 


(Equation 24.28) 


"s G - 1)(41.29 - 1) 


P du = HAE29 = 1) + 6G) ` 


The value à = .82 computed for partial omega squared indicates that 82% (or a 
proportion of .82) of the variability on the dependent variable (the number of nonsense syllables 
correctly recalled) is associated with variability on the levels of the independent variable (noise). 
Note that because it does not take into account between-subjects variability, partial omega 
squared yields a much higher value than standard omega squared. 

It was noted in an earlier discussion of omega squared (in Section VI of the ¢ test for two 
independent samples) that Cohen (1977; 1988, pp. 284—287) has suggested the following (ad- 
mittedly arbitrary) values, which are employed in psychology and a number of other disciplines, 
as guidelines for interpreting ©”: a) A small effect size is one that is greater than .0099 but not 
more than .0588; b) A medium effect size is one that is greater than .0588 but not more than 
.1379; and c) A large effect size is greater than .1379. If one employs Cohen’s (1977,1988) 
guidelines for magnitude of treatment effect, both a, = .44 and à - .82 represent a large 
treatment effect. 


Cohen's f index (Test 24h) Ifthe value of partial omega squared is substituted in Equation 
21.45, Cohen’s f index can be computed. In Section VI of the single-factor between-subjects 
analysis of variance, it was noted that Cohen's f index is an alternate measure of effect size that 
can be employed for an analysis of variance. The computation of Cohen's f index with Equation 
21.45 yields the value f= 2.13. 


-2 
f- o _ |_ 8  .513 
1 - & 1 - .82 





In the discussion of Cohen’s findex in Section VI of the single-factor between-subjects 
analysis of variance, it was noted that Cohen (1977; 1988, pp. 284—288) employed the following 
(admittedly arbitrary) f values as criteria for identifying the magnitude of an effect size: a) A 
small effect size is one that is greater than .1 but not more than .25; b) A medium effect size is 
one that is greater than .25 but not more than .4; and c) A large effect size is greater than .4. 
Employing Cohen’s criteria, the value f= 2.13 represents a large effect size. 

A thorough discussion of the general issues involved in computing a measure of magnitude 
of treatment effect for a single-factor within-subjects analysis of variance can be found in 
Keppel (1991) and Kirk (1995). Further discussion of the indices of treatment effect discussed 
in this section, and the relationship between effect size and statistical power can be found in 
Section IX (the Addendum) of the Pearson product-moment correlation coefficient under the 
discussion of meta-analysis and related topics. 


6. Computation of a confidence interval for the mean of a treatment population Prior to 
reading this section the reader should review the discussion of confidence intervals in Section 
VI of both the single sample ¢ test (Test 2) and the single-factor between-subjects analysis of 
variance. The same procedure employed to compute a confidence interval for a treatment popu- 
lation for the single-factor between-subjects analysis of variance is employed for computing 
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the confidence interval for the mean of a treatment population for a single-factor within- 
subjects analysis of variance. In other words, in order to compute a confidence interval for any 
experimental treatment/condition, one must conceptualize a within-subjects design as if it was 
a between-subjects design. The reason for this is that a confidence interval for any single 
condition will be a function of the variability of the scores of subjects who serve within that 
condition. Since MS e» the measure of error variability for the repeated-measures analysis of 
variance, is a measure of within-subjects variability that is independent of any treatment effect, 
it cannot be employed to estimate the error variability for a specific treatment if one wants to 
compute a confidence interval for the mean of a treatment population. As is the case with the 
single-factor between-subjects analysis of variance, one can employ either of the following 
two strategies in computing the confidence interval for a treatment. 

a) If one assumes that all of the treatments represent a population with the same variance, 
Equation 21.48 can be employed to compute the confidence interval (in our example we will 
assume the 95% confidence interval is being computed). In order to employ Equation 21.48 it 
is necessary to compute the value of MS, which in the case of the single-factor within- 
subjects analysis of variance can be conceptualized as a within-conditions mean square 
(MSc). In order to compute MS,,,., it is necessary to first compute the within-conditions sum 
of squares (SSc). The latter value is computed employing Equation 24.29 (which is identical 
to Equation 21.5, except it employs the subscript WC in place of WG). 


rp. C€X* 


J 


k : 
(Equation 24.29) 
SSwc = 2. 
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The within-conditions degrees of freedom is computed in an identical manner as is the 
within-groups degrees of freedom for the single-factor between-subjects analysis of variance. 
Thus, using Equation 21.9 (using the subscript WC in place of WG), df. - N-k-18-3-15. 
The within-conditions mean square can now be computed: MS, = SSy-/dfyc = 41.83/15 
— 2.79. Employing Equation 21.48, the 95% confidence interval for the mean of the population 
represented by Condition 1 is computed. The value £9, = 2.13 is the tabled critical two-tailed 
t ys value for df. = 15. 


MS 
Clos 7X, bapa "- = 85 t 2.13 22 - &5 t 145 


Thus, the researcher can be 95% confident (or the probability is .95) that the mean of the 
population represented by Condition 1 falls within the range 7.05 to 9.95. Stated symbolically: 
7.05 < u < 9.95. 

b) If one has reason to believe that the treatment in question is distinct from the other 
treatments, Equation 2.7 can be employed to compute the confidence interval. Specifically, if 
the mean value of a treatment is substantially above or below the means of the other treatments, 
it can be argued that one can no longer assume that the treatment shares a common variance with 
the other treatments. In view of this, one can take the position that the variance of the treatment 
is the best estimate of the variance of the population represented by that treatment (as opposed 
to the pooled variability of all of the treatments involved in the study). This position can also be 
taken even if the means of the k treatments are equal, but the treatments have substantially 
different estimated population variances. 
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If Equation 2.7 is employed to compute the 9596 confidence interval for the population 
mean of Condition 1, the estimated variance of the population Condition 1 represents is 
employed in lieu of the pooled within-conditions variability. In addition, the tabled critical two- 
tailed t,, value for df = n — 1 = 5 (which is £y, = 2.57) is employed in the equation. The 
computation of the confidence interval is illustrated below. Initially, the estimated population 
standard deviation is computed, which is then substituted in Equation 2.7. 





| (51 
i, = - - 1.38 
: s 
CX aega 49:57) 28) sa 
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Thus, the result obtained with Equation 2.7 indicates that the range for CI, is 7.05 < m 
< 9.95.5 Note that in the case of Condition 1, the confidence intervals computed with Equations 
21.48 and 2.7 are identical. This will not always be true, especially when the within-condition 
variability of the treatment for which the confidence interval is computed is substantially 
different than the within-condition variability of the other treatments. 


VII. Additional Discussion of the Single-Factor Within-Subjects 
Analysis of Variance 


1. Theoretical rationale underlying the single-factor within-subjects analysis of variance 
In the single-factor within-subjects analysis of variance the total variability can be partitioned 
into the following two elements: a) Between-subjects variability (which is represented by 
MS) represents the variability between the mean scores of the n subjects. In other words, a 
mean value for each subject who has served in each of the k experimental conditions is 
computed, and MS, represents the variance of the n subject means; and b) Within-subjects 
variability (which will be represented by the notation MS,,.) represents variability within the 
k scores of each of the 71 subjects. In other words, for each subject the variance for that subject's 
k scores is computed, and the average of the n variances represents within-subjects variability. 
Within-subjects variability can itself be partitioned into two elements: between-conditions 
variability (which is represented by MS.) and residual variability (which is represented by 
MS,..). Between-conditions variability is essentially a measure of variance of the means of the 
k experimental conditions. In the single-factor within-subjects analysis of variance, it is 
assumed that any variability between the means of the conditions can be attributed to one or both 
of the following two elements: 1) The experimental treatments; and 2) Experimental error. 
When MS pc (the value computed for between-conditions variability) is significantly greater than 
MS,., (the value computed for error variability), it is interpreted as indicating that a substantial 
portion of between-conditions variability is due to a treatment effect. The rationale for this is as 
follows. 

Experimental error is random variability in the data that is beyond the control of the 
researcher. In a within-subjects design the average amount of variability within the k scores of 
each of the n subjects that cannot be accounted for on the basis of a treatment effect is employed 
to represent experimental error. Thus, the value computed for MS... is the normal amount of 
variability that is expected for any subject who serves in each of k experimental conditions, if the 
conditions are equivalent to one another. Within this framework, residual variability is employed 
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as a baseline to represent variability which results from factors that are beyond an experimenter’s 
control. The experimenter assumes that even if no treatment effect is present, since such uncon- 
trollable factors are responsible for within-subjects variability, it is logical to assume that they 
can produce differences of a comparable magnitude between the means of the k experimental 
conditions. As long as the variability between the condition means (MS,_.) is approximately the 
same as residual variability (MS...) the experimenter can attribute any between-conditions 
variability present to experimental error. When, however, between-conditions variability is 
substantially greater than residual variability, it indicates that something over and above error 
variability is contributing to the variability between the k condition means. In such a case, it is 
assumed that a treatment effect is responsible for the larger value of MS, relative to the value 
of MS e. In essence, if residual variability is subtracted from within-subjects variability, any 
remaining variability within the scores of subjects can be attributed to a treatment effect. If there 
is no treatment effect, the result of the subtraction will be zero. Of course, one can never 
completely rule out the possibility that if MS, is larger than MS,.., the larger value for MS, 
is entirely due to error variability. However, since the latter is unlikely, when MS, is 
significantly larger than MS... , it is interpreted as indicating the presence of a treatment effect. 


res? 


Table 24.13 Alternative Summary Table for Analysis 
of Variance for Example 24.1 


Source of variation SS df MS F 

Between-subjects 36.93 5 7.39 

Within-subjects 45.35 12 3.78 
Between-conditions 40.45 2 20.23 41.29 
Residual 4.9 10 49 

Total 82.28 17 


In some sources a table employing the format depicted in Table 24.13 is used to summarize 
the results of a single-factor within-subjects analysis of variance. In contrast to Table 24.2, 
which does not include a row documenting within-subjects variability, Table 24.13 includes the 
latter variability, which is partitioned into between-conditions variability and residual variability. 
In point of fact, it is not necessary to compute the information documented in the row for within- 
subjects variability in order to compute the F ratio. 

Note that in Table 24.13 the following relationships will always be true: a) SS, = SSp. 
+ SS; b) dfr = dfes + Buys; ©) SSyo = SSp- + SS, and d) dfi. = dfc + df... The 
values SS... df,., and MS, in Table 24.13 are, respectively, computed with Equations 


24.30, 24.31, and 24.32. 





n (XSy 
SSys = EX; - 2 esr (Equation 24.30) 
dfys = nk - 1) (Equation 24.31) 
s S ws . 
MS, = (Equation 24.32) 


dfys 


The values SS, = 45.35, dfyg = 12,and MS, = 3.78 are computed below for Example 
24.1. 
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2 2 2 2 2 2 
$$,,- 969 -|Q0* . G5* , US , @5P , Q4* , QO?) 4535 
3 3 3 3 3 3 
45.35 


df = 68 -1)=12 MS, = Z = 3.78 


Note that SS. = SS,í + SS... = 40.45 + 4.9 = 45.35 and dfys = df, + dfs = 2 
t 10 z 12. 

2. Definitional equations for the single-factor within-subjects analysis of variance In the 
description of the computational protocol for the single-factor within-subjects analysis of 
variance, Equations 24.2—24.5 are employed to compute the values SS,, SS,., SS,., and 
SS... The latter set of computational equations were employed, since they allow for the most 
efficient computation of the sum of squares values. As noted in Section IV, computational 
equations are derived from definitional equations which reveal the underlying logic involved in 
the derivation of the sums of squares. This section will describe the definitional equations for 
thesingle-factor within-subjects analysis of variance, and apply them to Example 24.1 in order 
to demonstrate that they yield the same values as the computational equations. 

As noted previously, the total sum of squares (SS,.) is made up of two elements, the 
between-subjects sum of squares (SS,,) and the within-subjects sum of squares ( SS yg), and that 
the latter sum of squares can be partitioned into the between-conditions sum of squares (SS,.) 
and the residual sum of squares (SS... ). The contribution of any single subject's score to the total 
variability in the data can be expressed in terms of a between-subjects component and a within- 
subjects component. When the between-subjects component and the within-subjects component 
are added, the sum reflects that subject’s total contribution to the overall variability in the data. 
The contribution of all N scores to the total variability (SS;.) and the elements that comprise 
it (SS,, and SS,,, and SS, and SS... which comprise the latter) are summarized in Table 
24.14. The definitional equations described in this section employ the following notation: 
X,, represents the score of the i subject in the j condition, X, represents the grand mean 
(which is X, = Oye iN = 119/18 = 6.61), X, represents the mean of the j ^ condition, 
and S, represents the mean of the k scores of the i subject. 

Equation 24.33 is the definitional equation for the total sum of squares." 


k n = 
SS. = DL X; - D (Equation 24.33) 


j=l i=l 


In employing Equation 24.33 to compute SS,., the grand mean ( X;.) is subtracted from each 
of the N scores and each of the N difference scores is squared. The total sum of squares (SS-,) 
is the sum of the N squared difference scores. Equation 24.33 is computationally equivalent to 
Equation 24.2. 

Equation 24.34 is the definitional equation for the between-subjects sum of squares. 


SSgs = ky (S, = x (Equation 24.34) 
i-1 


In employing Equation 24.34 to compute SS pg- the following operations are carried out for 
each of the n subjects. The grand mean (X,) is subtracted from the mean of the subject’s k 
scores. The difference score is squared and the squared difference score is multiplied by the 
number of experimental conditions (k). After this is done for all n subjects, the values that have 
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been obtained for each subject as a result of multiplying the squared difference score by k are 
summed. The resulting value represents the between-subjects sum of squares (55,.). Equation 
24.34 is computationally equivalent to Equation 24.4. 

Equation 24.35 is the definitional equation for the within-subjects sum of squares. 


n k - 
m x UG Sy (Equation 24.35) 


il j=l 


In employing Equation 24.35 to compute SS yg» the following operations are carried out for 
the k scores of each of the n subjects. The mean of a subject’s k scores (S,) is subtracted from 
each of the subject’s scores, and the k difference scores for that subject are squared. The sum 
of the k squared difference scores for all n subjects (i.e., the sum total of N squared difference 
scores) represents the within-subjects sum of squares (SS,,.). Equation 24.35 is computationally 
equivalent to Equation 24.30. 

Equation 24.36 is the definitional equation for the between-conditions sum of squares. 


k —. 
SSgc = nD ( ‘a x (Equation 24.36) 


In employing Equation 24.36 to compute SS,., the following operations are carried 
out for each experimental condition. The grand mean (X;) is subtracted from the condition 
mean (X;). The difference score is squared, and the squared difference score is multiplied by the 
number of scores in that condition (n). After this is done for all k conditions, the values that have 
been obtained for each condition as a result of multiplying the squared difference score by the 
number of subjects in the condition are summed. The resulting value represents the between- 
conditions sum of squares (SS,.). Equation 24.36 is computationally equivalent to Equation 
24.3. An alternative but equivalent method of obtaining SS pç (which is employed in deriving SS,. 
in Table 24.14) is as follows: Within each condition, for each of the n subjects the grand mean 
is subtracted from the condition mean, each difference score is squared, and upon doing this for 
all k conditions, the N squared difference scores are summed. 

Equation 24.37 is the definitional equation for the residual sum of squares. 





-EF a, - S, - Xj) - X, - X,)P (Equation 24.37) 


j=l i=l 


In employing Equation 24.37 to compute SS «> the following operations are carried out for 
each of the N scores: a) The grand mean (X,,) is subtracted from the score (X j^ b) The grand 
mean (X;)i is subtracted from the mean of the k scores for that subject (S, ); and c) The grand 


mean (X;.) is subtracted from the mean of the condition from which the score is derived ( X.). 
The value of the difference score obtained in b) is subtracted from the value of the difference 
score obtained in a), and the difference score obtained in c) is subtracted from the resulting 
difference. The resulting value is squared, and the sum of the squared values for all N scores 
represents the residual sum of squares (SS). Note that in Equation 24.37, for each subject a 
between-subjects and between-conditions component of variability is subtracted from the 
subject's contribution to the total variability, resulting in the subject's contribution to the residual 
variability. Equation 24.37 is computationally equivalent to Equations 24.5/24.6. 

Table 24.14 illustrates the use of Equations 24.33—24.37 with the data for Example 24.1.?? 
In the computations summarized in Table 24.14, the following S, values are employed: S, 


= 20/3 = 6.67, S, = 25/3 = 8.33, S, = 15/3 =5, S, = 25/3 = 8.33, S; = 14/3 = 4.67, S, = 20/3 
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=6.7. The resulting values of SS.., SS 


SS 


BS? 


SS 


Ws? 


BC? res 


and SS.. are identical to those obtained 


with the computational equations (Equations 24.2, 24.4, 24.30, 24.3, and 24.5/24.6). Any 
minimal discrepancies are the result of rounding off error. 


Table 24.14 Computation of Sums of Squares for Example 24.1 with Definitional Equations 


(Subject, 7 
Condition “ 


(1,1) 
(2,1) 
(3,1) 
(4,1) 
(5,1) 
(6,1) 


9 

0 

7 

0 

7 

8 

(1,2) 7 
(2,2) 8 
(3,2) 5 
8 

5 

6 

4 

7 

3 

7 

2 

6 


= 


Condition 1 


= 


Condition 2 (4.2) 


(5,2) 
(6,2) 


(1,3) 
(2,3) 
(3,3) 
(4,3) 
(5,3) 
(6,3) 


Condition 3 


(Subject, x. 
Condition " 


(1,1) 
(2,1) 
(3,1) 
(4,1) 
(5,1) 
(6,1) 


(1,2) 
(2,2) 
(3,2) 
(4,2) 
(5,2) 
(6,2) 


(1,3) 
(2,3) 
(3,3) 
(4,3) 
(5,3) 
(6,3) 


= 


Condition 1 


mm 


NDNNYNAWARIL DANANANAIT ON ONO” 


Condition 2 


Condition 3 
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SS, = È 


k 


Ms 


á X; il xy 


J 


(9.00-6.61? = 5.71 
(10.00-6.61)? = 11.49 
(7.00-6.61 =  .15 
(10.00-6.61)? = 11.49 
(7.00-6.61 =  .15 
(8.00-6.61)2= 1.93 


(7.00-6.61) = .15 
(8.00-6.61)? = 1.93 
(5.00-6.61)? = 2.59 
(8.00-6.61)? = 1.93 
(5.00-6.61)? = 2.59 
(6.00-6.61) = .37 


(4.00-6.61) = 6.81 
(7.00-6.61? =  .15 
(3.00-6.61) = 13.03 
(7.00-6.61? =  .15 
(2.00-6.61) = 21.25 
(6.00-6.61? 2  .37 


SS, = 82.24 


li-l 


Ka az = 
SSgo = NEX, - Xp” 
ja 


(8.550-6.61? 2 3.57 
(8.550-6.61? 2 3.57 
(8.550-6.61? 2 3.57 
(8.550-6.61? 2 3.57 
(8.550-6.61? 2 3.57 
(8.550-6.61? 2 3.57 
(6.50-6.61 =  .01 
(6.50-6.61 = .01 
(6.50-6.61 = .01 
(6.50-6.61 = .01 
(6.50-6.61 = .01 
(6.50-6.61 = .01 
(4.83-6.61? = 3.17 
(4.83-6.61? = 3.17 
(4.83-6.61? = 3.17 
(4.83-6.61? = 3.17 
(4.83-6.61? = 3.17 
(4.83-6.61? = 3.17 

$$, = 40.50 


$S4; = KS, - X, 
ja 


k Tu 
SS, = X EX- S? 
jail 

(9.00-6.67) = 5.43 
(10.00-8.33)?= 2.79 
(7.00-5.00)?= 4.00 
(10.00-8.33)?= 2.79 
(7.00-4.67) = 5.43 
(8.00-6.67) = 1.77 
(7.00-6.67" = 11 
(8.00-8.33)= 11 
(5.00-5.00? - .00 
(800-833)- 11 
(5.00-4.67 = 1l 
(6.00-6.67) 2 45 
(4.00-6.67 = 7.13 
(7.00-8.33? = 1.77 
(3.00-5.00)?= 4.00 
(7.00-8.33? = 1.77 
(2.00-4.67) = 7.13 
(6.00-6.67)2= .45 
SS, = 45.35 





(6.67-6.61? - .00 
(8.33-6.61) = 2.96 
(5.00-6.61) = 2.59 
(8.33-6.61) = 2.96 
(4.67-6.61) = 3.76 
(6.67-6.61) = .00 
(6.67-6.61) = .00 
(8.33-6.61) = 2.96 
(5.00-6.61) = 2.59 
(8.33-6.61) = 2.96 
(4.67-6.61)? = 3.76 
(6.67-6.61) = .00 
(6.67-6.61) = 00 
(8.33-6.61) = 2.96 
(5.00-6.61) = 2.59 
(8.33-6.61) = 2.96 
(4.67-6.61)? = 3.76 
(6.67-6.61) = .00 

$$,, = 36.81 
SS es z X 


1 


j=l 


[(9.00-6.61)- (6.67-6.61)-(8.50-6.61)]? 
[(10.00-6.61)-(8.33-6.61)- (8.50- 6.61) 
[(7.00-6.61)- (5.00- 6.61)-(8.50-6.61)]? 
[(10.00-6.61)-(8.33-6.61)- (8.50- 6.61) 
[(7.00-6.61)- (4.67-6.61)-(8.50-6.61)]? 
[(8.00-6.61)-(6.67-6.61)-(8.50-6.61)]? 


[(7.00—6.61)- (6.67-6.61)-(6.50-6.61)]? 
[(8.00-6.61)-(8.33-6.61)-(6.50-6.61)]? 
[(5.00-6.61)- (5.00-6.61)-(6.50-6.61)]? 
[(8.00-6.61)-(8.33-6.61)-(6.50-6.61)] 
[(5.00-6.61)- (4.67-6.61)-(6.50-6.61)]? 
[(6.00—6.61)- (6.67-6.61)-(6.50-6.61)]? 


[(4.00-6.61)- (6.67-6.61)- (4.83-6.61)]? 
[(7.00-6.61)- (8.33-6.61)- (4.83-6.61)] 
[(3.00-6.61)- (5.00- 6.61)- (4.83-6.61)]? 
[(7.00-6.61)- (8.33-6.61)-(4.83-6.61)] 


IX; - X - (S,- X) -X -Xp 


.19 
.05 
.01 
.05 
.19 
31 


.19 
.05 
.01 
.05 
.19 
31 


79 
.20 
.05 
.20 


[(2.00-6.61)-(4.67-6.61)-(4.83-6.61) = .79 
[(6.00—6.61)- (6.67 -6.61)- (4.83-6.61) = 1.23 


SS, = 4.86 


Table 24.15 Summary Table of Single-Factor 
Between-Subjects Analysis of Variance for Example 24.1 


Source of variation SS df MS F 
Between-groups 40.45 2 20.23 7.25 
Within-groups 41.83 15 2.79 

Total 8228 17 


3. Relative power of the single-factor within-subjects analysis of variance and the single- 
factor between-subjects analysis of variance The use of MS... as the measure of error vari- 
ability (as opposed to MS,,.) for the single-factor within-subjects analysis of variance 
provides for an optimally powerful test of an alternative hypothesis." The reason why MS... 
allows for a more powerful test of an alternative hypothesis than MSc is because when no 
treatment effect is present in the data, it is expected that the average variability of the k scores 
of n subjects will be less than the average variability of the scores of n different subjects who 
serve in any single experimental condition (in an experiment involving k experimental 
conditions). 

To illustrate this point, let us assume that the data for Example 24.1 are obtained in an 
experiment employing an independent groups/between-subjects design, and as a result of the 
latter MS, = MS, is employed as the measure of error variability. Thus, we will assume that 
each of N = 18 subjects is randomly assigned to one of k = 3 experimental conditions, resulting 
in n = 6 scores per condition. The data for such an experiment will be evaluated with a single- 
factor between-subjects analysis of variance. In conducting the computations for the latter 
analysis, the value of SS- is computed with Equation 21.2 (which, in fact, is identical to Equation 
24.2, which is employed to compute SS, when Example 24.1 is evaluated with a single-factor 
within-subjects analysis of variance). Thus, SS, = 82.28. Equation 21.3, which is employed 
to compute the between-groups sum of squares ($S,..) is, in fact, identical to Equation 24.3 
(which is employed in Section IV to compute the between-conditions sum of squares (SS,/.)). 
Thus, SS, = SS, = 40.45. The within-groups sum of squares ( SS yg) can be computed with 
Equation 21.4. Thus, SSyg = SS; - SS,4; = 82.28 - 40.45 = 41.83. Note that the latter 
value is identical to the value computed with Equation 24.29 (which as noted earlier is com- 
putationally equivalent to Equation 21.5, which yields the same value as Equation 21.4). 
Employing the values k = 3 and N = 18 in Equations 21.8-21.10, the values df, = 2, 
dfyg = 15, and df, = 17 are computed. Substituting the appropriate degrees of freedom in 
Equations 21.6 and 21.7, the values MS, = 40.45/2 = 20.23 and MS, = 41.83/15 = 2.79 
are computed. Using Equation 21.12, F = 20.23/2.79 = 7.25. Table 24.15 is the summary table 
of the analysis of variance. 

Since df, = dfgg = 2 and dfin = Mwg = 15, Fos = 3.68 and Fp = 6.36 are the 
critical values in Table A10 that are employed to evaluate the nondirectional alternative 
hypothesis. Since the obtained value F = 7.25 is greater than both of the aforementioned critical 
values, the alternative hypothesis is supported at both the.05 and .01 levels. Note, however, that 
the value F = 7.25 is substantially less than the value F = 41.29, which is obtained when the same 
set of data is evaluated with the single-factor within-subjects analysis of variance. Although 
the value F = 7.25 obtained for a between-subjects analysis is significant at both the .05 and .01 
levels, F = 7.25 is not very far removed from the tabled critical value F,, = 6.36. The value 
F = 41.29, on the other hand, is well above the tabled critical value Ey” 7.56 (which is the 
tabled critical .01 value employed for the single-factor within-subjects analysis of variance 
for df, = 2 and df... = 10). The fact that the difference between the computed F value and 
the tabled critical F,, value is much larger when the single-factor within-subjects analysis of 
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variance is employed, illustrates that a within-subjects analysis provides a more powerful test 
of an alternative hypothesis than a between-subjects analysis.” 

It should be noted that for the same set of data, the tabled critical F value at a given level 
of significance for a single-factor between-subjects analysis of variance will always be lower 
than the tabled critical F value for a single-factor within-subjects analysis of variance (unless 
there is an extremely large number of scores in each condition, in which case the tabled critical 
F values for both analyses will be equivalent). This is the case since (as long as n is not 
extremely large) the number of degrees of freedom associated with the denominator of the F ratio 
will always be larger for a single-factor between-subjects analysis of variance (assuming the 
values of n = n, and k for both analyses are equal) than for a single-factor within-subjects 
analysis of variance — i.e., df, > df... It is important to note, however, that any loss of 
degrees of freedom associated with a within-subjects analysis will more than likely be offset as 
a result of employing MS as the error term in the computation of the F ratio. A final point that 
should be made is that, if in a within-subjects design, subjects’ scores in the k experimental 
conditions are not correlated with one another (which is highly unlikely), asingle-factor within- 
subjects analysis of variance and a single-factor between-subjects analysis of variance (as 
well as a f test for two dependent samples and a ¢ test for two independent samples when k 
= 2) will yield comparable results. 


4. Equivalency of the single-factor within-subjects analysis of variance and the ¢ test for 
two dependent samples when k 22 Interval/ratio data for an experiment involving k = 2 
dependent samples can be evaluated with either a single-factor within-subjects analysis of 
variance or a f test for two dependent samples. When both of the aforementioned tests are 
employed to evaluate the same set of data they will yield the same result. Specifically, the 
following will always be true with respect to the relationship between the computed F and t 
values for the same set of data: F = t? and t = yF. It will also be the case that the square of 
the tabled critical t value at a prespecified level of significance for df = n — 1 will be equal to the 
tabled critical F value at the same level of significance for df, = 1 and df... (which will be 
df.,-(n-1(k-1) 2(n-1)2-1) 2n- 1, which is equivalent to the value df = n — 1 
employed for the ¢ test for two dependent samples). 

To illustrate the equivalency of the results obtained with the single-factor within-subjects 
analysis of variance and the ¢ test for two dependent samples when k = 2, an F value will be 
computed for Example 17.1. The value t = 2.86 is obtained (a more precise value t = 2.848 is 
obtained if all computations are carried out to 3 decimal places) for the latter example when the 
t test for two dependent samples is employed. When the same set of data is evaluated with the 
single-factor within-subjects analysis of variance, the value F = 8.11 is computed. Note 
that (t = 2.848)? = (F = 8.11). Equations 24.2—24.5 are employed below to compute the 
values SS,, SS,., SS,,. and SS... for Example 17.1. Since k = 2, n = 10, and nk = N = 20, 
dro = 2-1-1, df = 10 - 1 = 9, df, = (10 - D) - 1) = 9, and 
df, = 20 - 1 = 19. The full analysis of variance is summarized in Table 24.16. 





(47)? + = _ O8 _ 198 
20 


(78)° 
SS, = 440 - 24 = 135.8 SS,.- 
a 20 BC 10 





SS, = Oe POR? qu IP - s - 108.8 


$$,, = 135.8 - 12.8 - 108.8 = 14.2 
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Table 24.16 Summary Table of Analysis of Variance 
for Example 17.1 


Source of variation SS df MS F 
Between-subjects 108.8 9 

Between-conditions 12.8 1 12.80 8.11 
Residual 14.2 9 1.58 

Total 135.8 19 


For dfc - ] and df... - 9, the tabled critical .05 and .01 values are Fos = 5. 12 and 
Fg, = 10.56 (which are appropriate for a nondirectional analysis). Note that (if one takes into 
account rounding off error) the square roots of the aforementioned tabled critical values are (for 
df = 9) the tabled critical two-tailed values t,, = 2.26 and t,, = 3.25 that are employed in 
Example 17.1 to evaluate the value t= 2.86. Since the obtained value F = 8.11 is greater than F,; = 5.12 
but less than F, = 10.56, the nondirectional alternative hypothesis H,: p, * p, is supported, 
but only at the .05 level. The directional alternative hypothesis H,: p; > m, is supported at 
both the .05 and .01 levels, since F = 8.11 is greater than the tabled critical one-tailed .05 and .01 
values F,, = 3.36 and F,, = 7.95 (the square roots of which are the tabled critical one-tailed 
.05 and .01 values t,, = 1.83 andt,, = 2.82 employed for Example 17.1).? The conclusions 
derived from the single-factor within-subjects analysis of variance are identical to those 
reached when the data are evaluated with the ¢ test for two dependent samples. 


5. The Latin square design Prior to reading this section the reader is advised to review the 
discussion of counterbalancing in Section VII of the t test for two dependent samples. In the 
latter section, the subject of counterbalancing is discussed within the context of controlling for 
order effects. In a within-subjects design one method of controlling for the potential influence 
of practice or fatigue (both of which represent examples of order effects) is by employing what 
is referred to as a Latin square design. The latter type of design (which provides incomplete 
counterbalancing) is more likely to be considered as a reasonable option for controlling for 
order effects when the independent variable is comprised of many levels (and consequently it 
becomes prohibitive to employ complete counterbalancing). If we conceptualize a within- 
subjects design as being comprised of n rows (corresponding to each of the n subjects) and k 
columns (corresponding to each of the k treatments/conditions), we can define a Latin square 
design as one in which each treatment appears only one time in each row and only one time in 
each column. Thus, in Table 24.17 there are n = 4 subjects and k = 4 treatments, and since n = 
k the configuration of the design is a 4 x 4 square. The four treatments are identified by the 
letters A, B, C, and D. Subject 1 receives the treatments in the order A, B, C, D, Subject 2 
receives the treatments in the order C, A, D, B, and so on. As noted above, this design does not 
employ complete counterbalancing since, with k = 4 treatments, there are k! = 4! = 24 possible 
presentation orders for the treatments. Thus, a minimum of 24 subjects will be required in order 
to have complete counterbalancing. 


Table 24.17 Latin Square Design 


Treatment 


Subject 


AU tB 
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The purpose of the Latin square arrangement is to equally distribute any order effects that 
are present over the k treatments. À Latin square design, however, will only provide effective 
control for order effects if there is no interaction between order of presentation and the 
treatments. The concept of interaction is discussed in detail in Section V of the between- 
subjects factorial analysis of variance (Test 27). However, within this context, an interaction 
between order of presentation and treatment will be present if the performance of subjects on a 
given treatment is not only a function of the treatment itself, but also depends on which 
treatments precedes it. In point of fact, the absence of an interaction between order of 
presentation and treatments is critical in any within-subjects design, since if an interaction is 
present it will not be possible to obtain a pure measure for any treatment effects that may be 
present. 

Latin square designs can also be utilized for a number of other purposes. One in particular 
is to control for the influence of what are considered to be potentially confounding variables 
(also referred to as extraneous or nuisance variables). The interested reader can find detailed 
discussions of the various applications of Latin square designs in Keppel (1991), Kirk (1982, 
1995), Maxwell and Delaney (1990), Myers and Well (1995), and Winer et al. (1991). 


VIII. Additional Examples Illustrating the Use of the Single-Factor 
Within-Subjects Analysis of Variance 


Since the single-factor within-subjects analysis of variance can be employed to evaluate 
interval/ratio data for any dependent samples design involving two or more experimental condi- 
tions, it can be used to evaluate any of the examples that are evaluated with the ¢ test for two 
dependent samples (with the exception of Example 17.2). Examples 24.2-24.6 are, 
respectively, extensions of Examples 17.1, 17.3, 17.5, 17.7, and 17.6. As is the case with 
Examples 17.3 and 17.5, Examples 24.3 and 24.4 employ matched subjects, and are thus 
evaluated as a within-subjects design. Examples 24.6 and 24.7 represent extensions of the 
before-after design to a design involving k = 3 experimental conditions. Since the data for all 
of the examples are identical to the data employed in Example 24.1, they yield the same result. 


Example 24.2. A psychologist conducts a study to determine whether or not people exhibit 
more emotionality when they are exposed to sexually explicit words, aggressively toned words, 
or neutral words. Each of six subjects is shown a list of 15 randomly arranged words, which are 
projected on a screen one at a time for a period of five seconds. Five of the words on the list are 
sexually explicit, five of the words are aggressively toned, and five of the words are neutral. As 
each word is projected on the screen, a subject is instructed to say the word softly to him or 
herself. As a subject does this, sensors attached to the palms of the subject's hands record 
galvanic skin response (GSR), which is used by the psychologist as a measure of emotionality. 
The psychologist computes the following three scores for each subject, one score for each of the 
three experimental conditions: Condition 1: GSR/Sexually explicit — The average GSR score 
for the five sexually explicit words; Condition 2: GSR/Aggressively toned — The average GSR 
score for the five aggressively toned words; Condition 3: GSR/Neutral — The average GSR 
score for the five neutral words. The GSR/Sexually explicit, GSR/Aggressively toned, and GSR/ 
Neutral scores of the six subjects follow. (The higher the score, the higher the level of 
emotionality.) Subject 1 (9, 7, 4); Subject 2 (10, 8, 7); Subject 3 (7, 5, 3); Subject 4 (10, 8, 7); 
Subject 5 (7, 5, 2); Subject 6 (8, 6, 6). Do subjects exhibit differences in emotionality with 
respect to the three categories of words? 


Example 24.3 A psychologist conducts a study in order to determine whether people exhibit 
more emotionality when they are exposed to sexually explicit words, aggressively toned words, 
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or neutral words. Six sets of identical triplets are employed as subjects and within each set of 
triplets one member of the set is treated as follows: a) One of the triplets is randomly assigned 
to Condition 1, in which the subject is shown a list of five sexually explicit words; b) One of the 
triplets is randomly assigned to Condition 2, in which the subject is shown a list of five aggres- 
sively toned words; and c) One of the triplets is randomly assigned to Condition 3, in which the 
subject is shown a list of five neutral words. As each word is projected on the screen, a subject 
is instructed to say the word softly to him or herself. As a subject does this, sensors attached to 
the palms of the subject's hands record galvanic skin response (GSR), which is used by the 
psychologist as a measure of emotionality. The psychologist computes the following three scores 
for each set of triplets to represent the emotionality score for each of the experimental 
conditions: Condition 1: GSR/Sexually explicit — The average GSR score for the subject 
presented with the five sexually explicit words; Condition 2: GSR/Aggressively toned — The 
average GSR score for the subject presented with the five aggressively toned words; Condition 
3: GSR/Neutral — The average GSR score for the subject presented with the five neutral words. 
The GSR/Sexually explicit, GSR/Aggressively toned, and GSR/Neutral scores of the six sets of 
triplets follow. (The first score for each triplet set represents the score of the subject presented 
with the sexually explicit words, the second score represents the score of the subject presented 
with the aggressively toned words, and the third score represents the score of the subject 
presented with the neutral words. The higher the score, the higher the level of emotionality.) 
Triplet set 1 (9, 7, 4); Triplet set 2 (10, 8, 7); Triplet set 3 (7, 5, 3); Triplet set 4 (10, 8, 7); 
Triplet set 5 (7, 5, 2); Triplet set 6 (8, 6, 6). Do subjects exhibit differences in emotionality with 
respect to the three categories of words? 


Example 24.4 A researcher wants to assess the impact of different types of punishment on the 
emotionality of mice. Six sets of mice derived from six separate litters are employed as subjects. 
Within each set, one of the litter mates is randomly assigned to one of the three experimental 
conditions. During the course of the experiment each mouse is sequestered in an experimental 
chamber. While in the chamber, each of the six mice in Condition 1 is periodically presented 
with a loud noise, and each of the six mice in Condition 2 is periodically presented with a blast 
of cold air. The six mice in Condition 3 (which is a no treatment control condition) are not 
exposed to any punishment. The presentation of the punitive stimulus for the animals in Con- 
ditions 1 and 2 is generated by a machine that randomly presents the stimulus throughout the 
duration of the time an animal is in the chamber. The dependent variable of emotionality 
employed in the study is the number of times each mouse defecates while in the experimental 
chamber. The number of episodes of defecation for the six sets of mice follows. (The higher the 
score, the higher the level of emotionality.) Litter 1 (9, 7, 4); Litter 2 (10, 8, 7); Litter 3 (7, 5, 
3); Litter 4 (10, 8, 7); Litter 5 (7, 5, 2); Litter 6 (8, 6, 6). Do subjects exhibit differences in 
emotionality under the different experimental conditions? 


Example 24.5 A study is conducted to evaluate the relative efficacy of two drugs (Clearoxin 
and Lesionoxin) and a placebo on chronic psoriasis. Six subjects afflicted with chronic psoriasis 
participate in the study. Each subject is exposed to both drugs and the placebo for a six-month 
period, with a three-month hiatus between treatments. Within the six subjects, the order of 
presentation of the experimental treatments is completely counterbalanced. The dependent 
variable employed in the study is a rating of the severity of a subject's lesions under the three 
experimental conditions. The lower the rating the more severe a subject's psoriasis. The scores 
of the six subjects under the three treatment conditions follow. (The first score represents the 
Clearoxin condition (which represents Condition 1), the second score the Lesionoxin condition 
(which represents Condition 2), and the third score the placebo condition (which represents 
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Condition 3).) Subject 1 (9, 7, 4); Subject 2 (10, 8, 7); Subject 3 (7, 5, 3); Subject 4 (10, 8, 7); 
Subject 5 (7, 5, 2); Subject 6 (8, 6, 6). Do the data indicate differences in subjects’ responses 
under the three experimental conditions? 


Example 24.6 In order to assess the efficacy of electroconvulsive therapy (ECT), a psychiatrist 
evaluates six clinically depressed patients who receive a series of ECT treatments. Each patient 
is evaluated at the following three points in time: a) One day prior to the first treatment in the 
ECT series; b) The day following the final treatment in the ECT series; and c) Six months after 
the final treatment in the ECT series. During each evaluation period a standardized interview 
is used to operationalize a patient's level of depression, and on the basis of the interview a 
patient is assigned a score ranging from 0 to 10. The higher a patient's score, the more de- 
pressed the patient. The depression scores of the six patients during each of the three time 
periods follow: Patient 1 (9, 7, 4); Patient 2 (10, 8, 7); Patient 3 (7, 5, 3); Patient 4 (10, 8, 7); 
Patient 5 (7, 5, 2); Patient 6 (8, 6, 6). Do the data indicate that the ECT is effective, and, if so, 
is the effect maintained six months after the treatment? 


Although, as described, Example 24.6 can be evaluated with a single-factor within- 
subjects analysis of variance, the design of the study does not allow one to rule out the potential 
impact of confounding variables. To be more specific, Example 24.6 (which represents an 
extension of a before-after design to more than two measurement periods) does not allow a 
researcher to draw definitive conclusions with respect to whether any observed changes in mood 
are, in fact, due to the ECT treatments." Thus, even if there is a significant decrease in subjects’ 
depression scores following the final ECT treatment, and the effect is still present six months 
later, factors other than ECT can account for such a result. As an example, all of the patients may 
have been depressed about a problem related to the economy, and if, in fact, during the course 
of the study the economy improves dramatically, the observed changes in mood can be attributed 
to the improved economy rather than the ECT. In order for the design of the above study to be 
suitable, it is necessary to include a control group — specifically, a comparable group of 
depressed patients who are not given ECT (or are given “sham” ECT treatments). By contrast- 
ing the depression scores of the control group with those of the treatment group, one can 
determine whether any observed differences across the three time periods are in fact attributable 
to the ECT. Inclusion of such a control group would require that the design of the above study 
be modified into a mixed factorial design. The latter design and the analysis of variance em- 
ployed to evaluate it are discussed in Section IX of the between-subjects factorial analysis of 
variance. 


Example 24.7 In order to assess the efficacy of a drug which a pharmaceutical company claims 
is effective in treating hyperactivity, six hyperactive children are evaluated during the following 
three time periods: a) One week prior to taking the drug; b) After a child has taken the drug for 
six consecutive months; and c) Six months after the drug is discontinued. The children are 
observed by judges who employ a standardized procedure for evaluating hyperactivity. During 
each time period a child is assigned a score between 0 and 10, in which the higher the score, the 
higher the level of hyperactivity. During the evaluation process, the judges are blind with 
respect to whether or not a child is taking medication at the time he or she is evaluated. The 
hyperactivity scores of the six children during the three time periods follow: Child 1: (9, 7, 4); 
Child 2: (10, 8, 7); Child 3: (7, 5, 3); Child 4: (10, 8, 7); Child 5: (7, 5, 2); Child 6: (8, 6, 
6). Do the data indicate that the drug is effective? 


Since it lacks a control group, Example 24.7 is subject to the same criticism that is noted for 
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Example 24.6. Because of the lack of a control group (i.e., a group of hyperactive children who 
do not receive medication), any observed differences in hyperactivity between two or more of 
the measurement periods can be the result of extraneous factors in the external environment or 
physiological/maturational changes in the children that are independent of whether a child is 
taking the drug. In spite of its limitations, it is not unusual to encounter the use of the design 
employed in Example 24.7 (which is commonly referred to as an ABA design) in behavior 
modification research. Such designs are most commonly employed with individual subjects in 
order to assess the efficacy of a treatment protocol. The letters A and B in an ABA design refer 
to whether or not a treatment is in effect during a specific time period. In Example 24.7, Time 
period 1 is designated A since no treatment is in effect. This initial measure of the subject's 
behavior provides the researcher with a baseline measure of hyperactivity. During Time period 
2, which is designated by the letter B, the treatment is in effect. If the treatment is effective, a 
decrease in hyperactivity in Time period 2 relative to Time period 1 is expected. Time period 
3 is once again designated A, since the treatment is no longer employed. If, in fact, the treatment 
is effective it is expected that a subject's level of hyperactivity during Time period 3 will be 
higher than in Time period 2, and, in fact, return to the baseline level obtained during Time 
period 1 (unless, of course, the drug has a permanent residual effect). When an ABA design is 
employed with an individual subject, the format of the data resulting from such a study is not 
suitable for evaluation with an analysis of variance. 
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Endnotes 


1. A within-subjects/repeated-measures design in which each subject serves under each of 
the k levels of the independent variable is often described as a special case of a random- 
ized-blocks design. The term randomized-blocks design is most commonly employed 
to describe a dependent samples design involving matched subjects. As an example, 
assume that 10 sets of identical triplets are employed in a study to determine the efficacy 
of two drugs when compared with a placebo. Within each set of triplets one of the 
members is randomly assigned to each of the three experimental conditions. Such a design 
is described in various sources as a matched-subjects/samples design, a dependent 
samples design, a correlated-subjects design, or a randomized-blocks design. Within 
the usage of the term randomized-blocks design, each set of triplets constitutes a block, 
and consequently 10 blocks are employed in the study with three subjects in each block. 


2. It should be noted that if an experiment is confounded, one cannot conclude that a sig- 
nificant portion of between-conditions variability is attributed to the independent variable. 
This is the case, since if one or more confounding variables systematically vary with the 
levels of the independent variable, a significant difference can be due to a confounding 
variable rather than the independent variable. 


3. A discussion of counterbalancing can be found in Section VII of the £ test for two 
dependent samples). 


4.  Inother words, each subject is tested under one of the six possible presentation orders for 
the three experimental conditions and. within the sample of six subjects, each of the 
presentation orders is presented once. Specifically, the following six presentation orders 
are employed: 1,2,3; 1,3,2; 2,1,3; 2,3,1; 3,1,2; 32,1. 


5. Although it is possible to conduct a directional analysis, such an analysis will not be 
described with respect to the analysis of variance. A discussion of a directional analysis 
when k = 2 can be found under the ¢ test for two dependent samples. In addition, a dis- 
cussion of one-tailed F values can be found in Section VI of the ¢ test for two independent 
samples under the discussion of the Hartley’s F nax test for homogeneity of variance/F 
test for two population variances (Test 11a). A discussion of the evaluation of a 
directional alternative hypothesis when k > 3 can be found in Section VII of the chi-square 
goodness-of-fit test (Test 8). Although the latter discussion is in reference to analysis of 
akindependent samples design involving categorical data, the general principles regarding 
analysis of a directional alternative hypothesis when k > 3 are applicable to the analysis of 
variance. 


6. In Section VII it is noted that the sum of between-conditions variability and residual 
variability represents what is referred to as within-subjects variability. The sum of 
squares of within-subjects variability (SS ç) is the sum of between-conditions vari- 
ability and residual variability — i.e., SS. = SS, + SS... 
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10. 


11. 


12. 


13. 


14. 


15. 


Since there is an equal number of scores in each condition, the Equation for SS, can also 
be written as follows: 


(EX + (EX)? + > + (EX,)? xy 
SSyc = ——— —— € 


Thus: SS, 


o = (ob? + 69% + 09] ao = 40.45 


6 


The equation for SS,. can also be written as follows: 


S = [Esy + OS +--+ OS] Xp? 


BS 
k n 


= 36.93 





. IRO? + Q5 + a5» + (14? + Qo] _ (1197 
3 18 


In the interest of accuracy, as is the case with the single-factor between-subjects analysis 
of variance, a significant omnibus F value indicates that there is at least one significant 
difference among all possible comparisons that can be conducted. Thus, it is theoretically 
possible that none of the simple/pairwise comparisons are significant, and that the 
significant difference (or differences) involves one or more complex comparisons. 


As noted in Section VI of the single-factor between-subjects analysis of variance, in 
some instances the CD,,, value associated with the Bonferroni-Dunn test will be larger 
than the CD, value associated with the Scheffé test. However, when there are c = 3 
comparisons, CD, will be greater than CD,,. 


One can, of course, conduct a replication study and base the estimate of MS... on the value 


of MS,,, obtained for the comparison in the latter study. In point of fact, one or more 
replication studies can serve as a basis for obtaining the best possible estimate of error 


variability to employ for any comparison conducted following an analysis of variance. 


If the means of each of the conditions for which a composite mean is computed are 
weighted equally, an even simpler method for computing the composite score of a subject 
is to add the subject's scores and divide the sum by the number of conditions that are 
involved. Thus, the composite score of Subject 1 can be obtained by adding 9 and 7 and 
dividing by 2. The averaging procedure will only work if all of the means are weighted 
equally. The protocol described in Section VI must be employed in instances where a 
comparison involves unequal weighting of means. 


The same result is obtained if (for the three difference scores) the score in the first condition 
noted is subtracted from the score in the second condition noted (i.e., Condition 2 — 


Condition 1; Condition 3 — Condition 1; Condition 3 — Condition 2). 


If the variance of Condition 2 is employed to represent the lowest variance, ry y also 
2-3 
equals .85. 


It should be noted that if the accuracy of the probabilities associated with the outcome of 
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16. 


17. 


18. 


19. 


20. 


21. 


22. 


one or more studies is subject to challenge, the accuracy of a pooled probability will be 
compromised. One can argue, however, that if enough replication studies are conducted, 
probability inaccuracies in one direction will most likely be balanced by probability 
inaccuracies in the opposite direction. 


Inspection of the df... = 10 curve reveals that for df. 


Tes Tes 
= 3.1 or greater will be associated with a power of 1. 


= 10, a value of approximately o 


A number of different alternative equations have been proposed for computing standard 
omega squared. Although a slightly different equation was employed in the first edition 
of this book, it yields approximately the same result that is obtained with Equation 24.25. 


In using Equation 2.7, $n is equivalent to sy . 
1 


n 
i- 


In employing double (or even more than two) summation signs such as 3x p the 
mathematical operations specified are carried out beginning with the summation sign that 
is farthest to the right and continued sequentially with those operations specified by 
summation signs to the left. Specifically, if k = 3 and n = 6, the notation x x in- 
dicates that the sum of the n scores in Condition 1 is computed, after which the sum of the 
n scores in Condition 2 is computed, after which the sum of the n scores in Condition 3 is 
computed. The final result will be the sum of all the aforementioned values that have been 
computed. On the other hand, the notation 27 , Ea indicates that the sum of the k 2 3 
scores of Subject 1 is computed, after which the sum of the k = 3 scores of Subject 2 is 
computed, and so on until the sum of the k = 3 scores of Subject 6 is computed. The final 
result will be the sum of all the aforementioned values that have been computed. In this 
example the final value computed for E» XX; will be equal to the final value computed 
for Xu. In obtaining the final value, however, the order in which the operations 
are conducted is reversed. Specifically, in computing 3x the sums of the k 
columns are computed and summed in order to arrive at the grand sum, while in computing 
M ae the sums of the n rows are computed and summed in order to arrive at the 
grand sum. 


For each of the N = 18 scores in Table 24.14, the following is true with respect to the 
contribution of any score to the total variability in the data: 


(X; - X) = (5; - X9 + (X, - S) 
Total deviation score = BS deviation score + WS deviation score 


and 





(Xj - S,) = X, 7 X) + K; - Xp - Gi 


i Xp B (X; ni X3] 
WS deviation score = BC deviation score + Residual deviation score 


As noted in Section VI under the discussion of computation of a confidence interval, MS Wo 
is equivalent to MS yg (which is the analogous measure of variability for the single-factor 
between-subjects analysis of variance). 


An issue discussed by Keppel (1991) that is relevant to the power of the single-factor 
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24. 


within-subjects analysis of variance is that even though counterbalancing is an effective 
procedure for distributing practice effects evenly over the k experimental conditions in a 
dependent samples/within-subjects design, if practice effects are, in fact, present in the data, 
the value of MS... will be inflated, and because of the latter the power of the single-factor 
within-subjects analysis of variance will be reduced. Keppel (1991) describes a metho- 
dology for computing an adjusted measure of MS... that is independent of practice effects, 
which allows for a more powerful test of an alternative hypothesis. 


In evaluating a directional alternative hypothesis, when k = 2 the tabled F,, and F og 
values (for the appropriate degrees of freedom) are, respectively, employed as the one- 
tailed .05 and .01 values. Since the values for F „ and F ,, are not listed in Table A10, 
the values F4, = 3.36 and F4, = 7.95 canbe obtained by squaring the tabled critical one- 
tailed values t,, = 1.83 and t9, = 2.82, by employing more extensive tables of the F 
distribution available in other sources, or through interpolation. 


Example 24.6 (as well as Example 24.7) can also be viewed as an example of what is 
commonly referred to as a time series design (although time series designs typically 
involve more measurement periods than are employed in the latter example). The latter 
design is essentially a before-after design involving one or more measurement periods 
prior to an experimental treatment, and one or more measurement periods following the 
experimental treatment. 
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Test 25 


The Friedman Two-Way Analysis of Variance by Ranks 
(Nonparametric Test Employed with Ordinal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test In a set of k dependent samples (where k > 2), do at least two 
of the samples represent populations with different median values? 


Relevant background information on test Priorto reading the material on the Friedman two- 
way analysis of variance by ranks, the reader may find it useful to review the general infor- 
mation regarding a dependent samples design contained in Sections I and VII of the f test for two 
dependent samples (Test 17). The Friedman two-way analysis of variance by ranks (Fried- 
man (1937)) is employed with ordinal (rank-order) data in a hypothesis testing situation 
involving a design with two or more dependent samples. The testis an extension of the binomial 
sign test for two dependent samples (Test 19) to a design involving more than two dependent 
samples and, when k = 2, the Friedman two-way analysis of variance by ranks will yield a 
result that is equivalent to that obtained with the binomial sign test for two dependent 
samples.' If the result of the Friedman two-way analysis of variance by ranks is significant, 
it indicates there is a significant difference between at least two of the sample medians in the set 
of k medians. As a result of the latter, the researcher can conclude there is a high likelihood that 
at least two of the samples represent populations with different median values. 

Inemploying the Friedman two-way analysis of variance by ranks, one ofthe following 
is true with regard to the rank-order data that are evaluated: a) The data are in a rank-order 
format, since it is the only format in which scores are available; or b) The data have been 
transformed into a rank-order format from an interval/ratio format, since the researcher has 
reason to believe that one or more of the assumptions of the single-factor within-subjects 
analysis of variance (Test 24) (which is the parametric analog of the Friedman test) are 
saliently violated. It should be pointed out that when a researcher elects to transform a set of 
interval/ratio data into ranks, information is sacrificed. This latter fact accounts for why there 
is reluctance among some researchers to employ nonparametric tests such as the Friedman two- 
way analysis of variance by ranks, even if there is reason to believe that one or more of the 
assumptions of the single-factor within-subjects analysis of variance have been violated. 

Various sources (e.g., Conover (1980, 1999), Daniel (1990)) note that the Friedman two- 
way analysis of variance by ranks is based on the following assumptions: a) The sample of n 
subjects has been randomly selected from the population it represents; and b) The dependent 
variable (which is subsequently ranked) is a continuous random variable. In truth, this assump- 
tion, which is common to many nonparametric tests, is often not adhered to, in that such tests are 
often employed with a dependent variable that represents a discrete random variable. 

As is the case for other tests that are employed to evaluate data involving two or more 
dependent samples, in order for the Friedman two-way analysis of variance by ranks to gen- 
erate valid results the following guidelines should be adhered to? a) To control for order effects, 


© 2000 by Chapman & Hall/CRC 


the presentation of the k experimental conditions should be random or, if appropriate, be counter- 
balanced; and b) If matched samples are employed, within each set of matched subjects each of 
the subjects should be randomly assigned to one of the k experimental conditions. 

As is noted with respect to other tests that are employed to evaluate a design involving two 
or more dependent samples, the Friedman two-way analysis of variance by ranks can also be 
used to evaluate a before-after design, as well as extensions of the latter design that involve 
more than two measurement periods. The limitations of the before-after design (which are 
discussed in Section VII of the ¢ test for two dependent samples) are also applicable when it 
is evaluated with the Friedman two-way analysis of variance by ranks. 


II. Example 


Example 25.1 is identical to Example 24.1 (which is evaluated with the single-factor within- 
subjects analysis of variance). In evaluating Example 25.1 it will be assumed that the ratio data 
(i.e., the number of nonsense syllables correctly recalled) are rank-ordered, since one or more of 
the assumptions of the single-factor within-subjects analysis of variance have been saliently 
violated. 


Example 25.1 A psychologist conducts a study to determine whether or not noise can inhibit 
learning. Each of six subjects is tested under three experimental conditions. In each of the 
experimental conditions a subject is given 20 minutes to memorize a list of 10 nonsense syllables, 
which the subject is told she will be tested on the following day. The three experimental con- 
ditions each subject serves under are as follows: Condition 1, the no noise condition, requires 
subjects to study the list of nonsense syllables in a quiet room. Condition 2, the moderate noise 
condition, requires subjects to study the list of nonsense syllables while listening to classical 
music. Condition 3, the extreme noise condition, requires subjects to study the list of nonsense 
syllables while listening to rock music. Although in each of the experimental conditions subjects 
are presented with a different list of nonsense syllables, the three lists are comparable with 
respect to those variables that are known to influence a person's ability to learn nonsense 
syllables. To control for order effects, the order of presentation of the three experimental 
conditions is completely counterbalanced. The number of nonsense syllables correctly recalled 
by the six subjects under the three experimental conditions follow. (Subjects? scores are listed 
in the order Condition 1, Condition 2, Condition 3.) Subject 1: 9, 7, 4; Subject 2: 10, 8, 7; 
Subject 3: 7, 5, 3; Subject 4: 10, 8, 7; Subject 5: 7, 5, 2; Subject 6: 8,6, 6. Do the data 


indicate that noise influenced subjects’ performance? 
III. Null versus Alternative Hypotheses 


Null hypothesis Hj; 9, = 0, = 0, 


(The median of the population Condition 1 represents equals the median of the population 
Condition 2 represents equals the median of the population Condition 3 represents. With respect 
to the sample data, when the null hypothesis is true the sums of the ranks (as well as the mean 
ranks) of all k conditions will be equal.) 


Alternative hypothesis H,: Not H, 


(This indicates that there is a difference between at least two of the k 2 3 population medians. 
It is important to note that the alternative hypothesis should not be written as follows: 
H,: 0, + 0, + 0,. The reason why the latter notation for the alternative hypothesis is incorrect 
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is because it implies that all three population medians must differ from one another in order to 
reject the null hypothesis. With respect to the sample data, if the alternative hypothesis is true, 
the sum of the ranks (as well as the mean ranks) of at least two of the k conditions will not be 
equal. In this book it will be assumed (unless stated otherwise) that the alternative hypothesis 
for the Friedman two-way analysis of variance by ranks is stated nondirectionally.)’ 


IV. Test Computations 


The data for Example 25.1 are summarized in Table 25.1. The number of subjects employed in 
the experiment is n = 6, and thus within each condition there are n = n, = n, = n, =6 scores. 
The original interval/ratio scores of the six subjects are recorded in the columns labelled X, , X,, 
and X,. The adjacent columns R,, R,, and R, note the rank-order assigned to each of the scores. 


Table 25.1 Data for Example 25.1 


Condition 1 Condition 2 Condition 3 
X, R, X, R, X, R, 
Subject 1 9 3 7 2 4 1 
Subject 2 10 3 8 2 7 1 
Subject 3 7 3 5 2 3 1 
Subject 4 10 3 8 2 7 1 
Subject 5 7 3 5 2 2 1 
Subject 6 8 3 6 1.5 6 1.5 
XR, = 18 XR, = 11.5 XR, = 6.5 
— YR = XR YR 
Enc. as Ros die cups mE 
n 6 n, n, 6 


The ranking procedure employed for the Friedman two-way analysis of variance by 
ranks requires that each of the k scores of a subject be ranked within that subject.* Thus, in 
Table 25.1, for each subject a rank of 1 is assigned to the subjects lowest score, a rank of 2 to 
the subject's middle score, and a rank of 3 to the subject's highest score. In the event of tied 
scores, the same protocol described for handling ties for other rank-order tests (discussed in detail 
in Section IV of the Mann-Whitney U test (Test 12)) is employed. Specifically, the average 
of the ranks involved is assigned to all scores tied for a given rank. The only example of tied 
scores in Example 25.1 is in the case of Subject 6 who has a score of 6 in both Conditions 2 and 
3. In Table 25.1 both of these scores are assigned a rank of 1.5, since if the scores of Subject 6 
in Conditions 2 and 3 were not identical, but were still less than the subject's third score (which 
is 8 in Condition 1), one of the two scores that are, in fact, tied would receive a rank of 1 and the 
other a rank of 2. The average of these two ranks (i.e., (1 + 2)/2 = 1.5) is thus assigned to each 
of the two tied scores. 

It should be noted that it is permissible to reverse the ranking protocol described above. 
Specifically, one can assign a rank of 1 to a subject's highest score, a rank of 2 to the subject's 
middle score, and a rank of 3 to the subject’s lowest score. This reverse ranking protocol will 
yield the same value for the Friedman test statistic as the protocol employed in Table 25.1. 

Upon rank-ordering the scores of the n = 6 subjects, the sum of the ranks is computed for 
each of the experimental conditions. In Table 25.1 the sum of the ranks of the j " condition is 
represented by the notation YR. Thus, XR, - 18, XR, - 11.5, XR, - 6.5. Although they 
are not required for the Friedman test computations, the mean rank (R;) for each of the 
conditions is also noted in Table 25.1. 
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The chi-square distribution is used to approximate the Friedman test statistic. Equation 
25.1 is employed to compute the chi-square approximation of the Friedman test statistic (which 
is represented in most sources by the notation x). 


k 


2, Ry 


j=l 


2 12 


DE - 3n(k +1 Equation 25.1 
x. xD ( ) (Eq ) 








Note that in Equation 25.1 the term D om )’] indicates that the sum of the ranks for each 
of the k experimental conditions is squared, and that the squared sums of ranks are summed. 
Substituting the appropriate values from Example 25.1 in Equation 25.1, the value X: - 11.08 
is computed. 


12 


= —_*__[(18)? + (11.5)? + (6.59] - 3698 + 1) = 11.08 
Uo b HU 699g oes) 
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V. Interpretation of the Test Results 


In order to reject the null hypothesis, the computed value X must be equal to or greater than 
the tabled critical chi-square value at the prespecified level of significance. The computed chi- 
square value is evaluated with Table A4 (Table of the Chi-Square Distribution) in the Ap- 
pendix. For the appropriate degrees of freedom, the tabled Xs value (which is the chi-square 
value at the 95th percentile) and the tabled X value (which is the chi-square value at the 99th 
percentile) are employed as the .05 and .01 critical values for evaluating a nondirectional 
alternative hypothesis. The number of degrees of freedom employed in the analysis are 
computed with Equation 25.2. Thus, df = 3 - 1 = 2. 


d=k-1 (Equation 25.2) 


For df = 2, the tabled critical .05 and .01 chi-square values are Xos = 5.99 and Xo 
= 9.21. Since the computed value xi = 11.08 is greater than Xs = 5.99 and Xo = 9.21, 
the alternative hypothesis is supported at both the .05 and .01 levels. A summary of the analysis 
of Example 25.1 with the Friedman two-way analysis of variance by ranks follows: It can be 
concluded that there is a significant difference between at least two of the three experimental 
conditions exposed to different levels of noise. This result can be summarized as follows: 
X2Q) = 11.08, p < .01. 

It should be noted that when the data for Example 25.1 are evaluated with a single-factor 
within-subjects analysis of variance, the null hypothesis can also be rejected at both the .05 and 
.01 levels. The reader should note, however, that the difference between the value X = 11.08 
(obtained for the Friedman test) and Xo — 9.2] (the .01 tabled critical value for the Friedman 
test) is much smaller than the difference between F = 41.29 (obtained for the analysis of 
variance) and F4, = 7.56 (the .01 tabled critical value for the analysis of variance). The smaller 
difference between the computed test statistic and the tabled critical value in the case of the 
Friedman test reflects the fact that, as a general rule (assuming that none of the assumptions of 
the analysis of variance are saliently violated), it provides a less powerful test of an alternative 
hypothesis than the analysis of variance. 


© 2000 by Chapman & Hall/CRC 


VI. Additional Analytical Procedures for the Friedman Two-Way 
Analysis of Variance by Ranks and/or Related Tests 


1. Tie correction for the Friedman two-way analysis of variance by ranks Some sources 
recommend that if there is an excessive number of ties in the overall distribution of scores, 
the value of the Friedman test statistic be adjusted. The tie correction results in a small in- 
crease in the value of xi (thus providing a slightly more powerful test of the alternative 
hypothesis). Equation 25.3 (based on a methodology described in Daniel (1990) and Marascuilo 
and McSweeney (1977)) is employed in computing the value C, which represents the tie 
correction factor for the Friedman two-way analysis of variance by ranks. 


Y (t; P: t) 
Cat ou (Equation 25.3) 
nk? - k) 


Where: s = the number of sets of ties 
t, = the number of tied scores in the i "' set of ties 
The notation P - I) indicates the following: a) For each set of ties, the number of 
ties in the set is subtracted from the cube of the number of ties in that set; and b) The sum of all 
the values computed in part a) is obtained. The tie correction will now be computed for Example 
25.1. In the latter example there is s = 1 set of ties in which there are 7; = 2 ties (i.e., the 
two scores of 6 for Subject 6 under Conditions 2 and 3). Thus: 


Te -p-e - 2 = 6 
i=l 


Employing Equation 25.3, the value C = .958 is computed. 


C-1-— Ê -= 958 


6[(3)° - 3] 


x which represents the tie-corrected value of the Friedman test statistic, is computed 
with Equation 25.4. 


X= = (Equation 25.4) 


c 
Employing Equation 25.4, the tie-corrected value ie = 11.57 is computed. 


= 9 e157 


.958 


As is the case with X: = 11.08 computed with Equation 25.1, the value c = 11.57 
computed with Equation 25.4 is significant at both the .05 and .01 levels (since it is greater than 
Xas = 5.99 and Xo = 9.21). Although Equation 25.4 results in a slightly less conservative 
test than Equation 25.1, in this instance the two equations lead to identical conclusions with 
respect to the null hypothesis. 
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2. Pairwise comparisons following computation of the test statistic for the Friedman 
two-way analysis of variance by ranks Prior to reading this section the reader should review 
the discussion of comparisons in Section VI of the single-factor between-subjects analysis of 
variance (Test 21). As is the case with the omnibus F value computed for an analysis of 
variance, the xi value computed with Equation 25.1 is based on an evaluation of all k ex- 
perimental conditions. When the value of x: is significant, it does not indicate whether just two 
or, in fact, more than two conditions differ significantly from one another. In order to answer 
the latter question, it is necessary to conduct comparisons contrasting specific conditions with 
one another. This section will describe methodologies that can be employed for conducting 
simple/ pairwise comparisons following the computation of a xi value.® 

In conducting a simple comparison, the null hypothesis and nondirectional alternative 
hypothesis are as follows: H,: 0, = 0, versus H,: 0, # 0,. Inthe aforementioned hypotheses, 
0, and 0, represent the medians of the populations represented by the two conditions in- 
volved in the comparison. The alternative hypothesis can also be stated directionally as follows: 
H: 0, > 0, or H: 0, < 0,- 

Various sources (e.g., Daniel (1990) and Siegel and Castellan (1988)) describe a 
comparison procedure for the Friedman two-way analysis of variance by ranks (which is 
essentially the application of the Bonferroni-Dunn method described in Section VI of the 
single-factor between-subjects analysis of variance to the Friedman test model). Through 
use of Equation 25.5, the procedure allows a researcher to identify the minimum required 
difference between the sums of the ranks of any two conditions (designated as CD.) in order for 
them to differ from one another at the prespecified level of significance.’ 


CD, = zs ue (Equation 25.5) 


The value of z, d is obtained from Table A1 (Table of the Normal Distribution) in the 
Appendix. In the case of a nondirectional alternative hypothesis, z, " is the z value above which 
a proportion of cases corresponding to the value «,,,/2c falls (where c is the total number of 
comparisons that are conducted). In the case of a directional alternative hypothesis, z,,. is the 
z value above which a proportion of cases corresponding to the value &,,,/c falls. When all 
possible pairwise comparisons are made c = [k(k - 1)]/2, and thus, 2c = k(k - 1). In Example 
25.1 the number of pairwise/simple comparisons that can be conducted are c [3(3 - 1)]/2 = 3 
— specifically, Condition 1 versus Condition 2, Condition 1 versus Condition 3, and Condition 
2 versus Condition 3. 

The value of z ,, will be a function of both the maximum familywise Type I error rate 
(Q,y,) the researcher is willing to tolerate and the total number of comparisons that are 
conducted. When a limited number of comparisons are planned prior to collecting the data, most 
sources take the position that a researcher is not obliged to control the value of à. In such a 
case, the per comparison Type I error rate («,,.) will be equal to the prespecified value of 
alpha. When &pw is not adjusted, the value of z, dj employed in Equation 25.5 will be the tabled 
critical z value that corresponds to the prespecified level of significance. Thus, if a 
nondirectional alternative hypothesis is employed and & = &pç = .05, the tabled critical two- 
tailed .05 value zy, = 1.96 is used to represent Zadj in Equation 25.5. If « = a, =.01, the 
tabled critical two-tailed .01 value zo, = 2.58 is used in Equation 25.5. In the same respect, if 
a directional alternative hypothesis is employed, the tabled critical .05 and .01 one-tailed values z o. 
= 1.65 and z,, = 2.33 are used for Za in Equation 25.5. 

When comparisons are not planned beforehand, it is generally acknowledged that the value 
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of «,,,, must be controlled so as not to become excessive. The general approach for controlling 
the latter value is to establish a per comparison Type I error rate which insures that & pẹ will 
not exceed some maximum value stipulated by the researcher. One method for doing this 
(described under the single-factor between-subjects analysis of variance as the Bonferroni- 
Dunn method) establishes the per comparison Type I error rate by dividing the maximum 
value one will tolerate for the familywise Type I error rate by the total number of comparisons 
conducted. Thus, in Example 25.1, if one intends to conduct all three pairwise comparisons and 
wants to insure that «pẹ does not exceed .05, à. = Opy/c = .05/3 = .0167. The latter 
proportion is used to determine the value of z, dj As noted earlier, if a directional alternative 
hypothesis is employed for a comparison, the value of z, d employed in Equation 25.5 is the z 
value above which a proportion equal to &,. = @,,/c of the cases falls. In Table A1, the z 
value that corresponds to the proportion .0167 is z = 2.13. By employing z, d 2.13 in 
Equation 25.5, one can be assured that within the “family” of three pairwise comparisons, Oy, 
will not exceed .05 (assuming all of the comparisons are directional). If a nondirectional 
alternative hypothesis is employed for all of the comparisons, the value of z q; will be the z value 
above which a proportion equal to @,,,/2c = @,./2 of the cases falls. Since 
Opc/2 = .0167/2 = .0083, z= 2.39. By employing z,, = 2.39 in Equation 25.5, one can be 
assured that & pw will not exceed .05.* 

In order to employ the CD, value computed with Equation 25.5, it is necessary to 
determine the absolute value of the difference between the sums of the ranks of each pair of 
experimental conditions that are compared.’ Table 25.2 summarizes the difference scores 
between pairs of sums of ranks. 


adj 


Table 25.2 Difference Scores Between Pairs of Sums 
of Ranks for Example 25.1 


IER, - ER,| = |18 - 11.5| = 6.5 
IER, - ER,| = |18 - 6.5| = 11.5 
IER, - ER,| = [11.5 - 6.5] = 5 


If any of the differences between the sums of ranks is equal to or greater than the CD, 
value computed with Equation 25.5, a comparison is declared significant. Equation 25.5 will 
now be employed to evaluate the nondirectional alternative hypothesis H,: 0, # 0, forall three 
pairwise comparisons. Since it will be assumed that the comparisons are unplanned and that the 
researcher does not want the value of c, to exceed .05, the value z, d = 2.39 will be used in 


computing CD,. 
CD, = (2.39) oor = Q.39)G.46) = 8.28 


The obtained value CD,, = 8.28 indicates that any difference between the sums of ranks 
of two conditions that is equal to or greater than 8.28 is significant. With respect to the three 
pairwise comparisons, the only difference between the sum of ranks of two conditions which is 
greater than CD, = 8.28 is |ER, - XR,| = 11.5. Thus, we can conclude there is a significant 
difference between Condition 1 and Condition 3. We cannot conclude that the difference 
between any other pair of conditions is significant. 

An alternative strategy that can be employed for conducting pairwise comparisons for 
the Friedman test model is to use one of the tests that are described for evaluating a dependent 
samples design involving k = 2 samples. Specifically, one can employ either the Wilcoxon 
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matched-pairs signed-ranks test (Test 18) or the binomial sign test for two dependent 
samples. Whereas the binomial sign test only takes into consideration the direction of the 
difference of subjects’ scores in the two experimental conditions, the Wilcoxon test rank-orders 
the interval/ratio difference scores of subjects. Because of the latter, the Wilcoxon test employs 
more information than the binomial sign test, and, consequently, will provide a more powerful 
test of an alternative hypothesis. Both the Wilcoxon test and binomial sign test will be used to 
conduct the three pairwise comparisons for Example 25.1.'° 

Use of the Wilcoxon matched-pairs signed-ranks test requires that for each comparison 
that is conducted the difference scores of subjects in the two experimental conditions be rank- 
ordered, and that the Wilcoxon T statistic be computed for that comparison. The exact distri- 
bution of the Wilcoxon test statistic can only be used when the value of & pç is equal to one of 
the probabilities documented in Table A5 (Table of Critical T Values for Wilcoxon's Signed- 
Ranks and Matched-Pairs Signed-Ranks Tests) in the Appendix. When @,,. is a value other 
than those listed in Table A5, the normal approximation of the Wilcoxon test statistic must be 
employed. 

When the Wilcoxon matched-pairs signed-ranks test is employed for the three pairwise 
comparisons, the following T values are computed: a) Condition 1 versus Condition 2: T = 0; 
b) Condition 1 versus Condition 3: T= 0; and c) Condition 2 versus Condition 3: T 20. When 
the aforementioned T values are substituted in Equations 18.2 and 18.3 (the uncorrected and 
continuity-corrected normal approximations for the Wilcoxon test), the following absolute z 
values are computed: a) Condition 1 versus Condition 2: z = 2.20 and z = 2.10; b) Condition 1 
versus Condition 3: z = 2.20 and z = 2.10; and c) Condition 2 versus Condition 3: z = 2.02 and 
z = 1.89. If we want to evaluate a nondirectional alternative hypothesis and insure that &, 
does not exceed .05, the value of & pç is set equal to .0167. Table A5 cannot be employed, since 
it does not list two-tailed critical T values for  ,,,,. In order to evaluate the result of the nor- 
mal approximation, we identify the tabled critical two-tailed .0167 z value in Table A1. In em- 
ploying Equation 25.5 earlier in this section, we determined that the latter value is z y, = 2.39. 
Since none of the z values computed for the normal approximation is equal to or greater than 
Z gig; = 2-39, none of the pairwise comparisons is significant. This result is not identical to that 
obtained with Equation 25.5, in which case a significant difference is computed for the com- 
parison Condition 1 versus Condition 3. Although the latter comparison (as well as the 
Condition 1 versus Condition 2 comparison) comes close when itis evaluated with the Wilcoxon 
test, it falls just short of achieving significance. 

In the event the researcher elects not to control the value of & pẹ and employs &po = .05 
in evaluating the three pairwise comparisons (once again assuming a nondirectional analysis), 
both the Condition 1 versus Condition 2 and Condition 1 versus Condition 3 comparisons are 
significant at the .05 level if the Wilcoxon test is employed. Specifically, both the uncorrected 
and corrected normal approximations are significant, since z = 2.20 and z = 2.10 (computed for 
both the Condition 1 versus Condition 2 and Condition 1 versus Condition 3 comparisons) are 
greater than the tabled critical two-tailed value z,, = 1.96. The Condition 2 versus Condition 
3 comparison is also significant, but only if the uncorrected value z = 2.02 is employed. 
Employing Table A5, we also determine that both the Condition 1 versus Condition 2 and 
Condition 1 versus Condition 3 comparisons are significant at the .05 level, since the computed 
value T = 0 for both comparisons is equal to the tabled critical two-tailed .05 value Tọ; = O (for 
n — 6). The Condition 2 versus Condition 3 comparison is not significant, since no two-tailed .05 
critical value is listed in Table A5 for n = 5. If Equation 25.5 is employed for the same set of 
comparisons, however, only the Condition 1 versus Condition 3 comparison is significant. This 
is the case, since CD, = (1.96)(3.46) = 6.78, and only the difference |ZR, - XR,| = 11.5 is 
greater than CD, = 6.78." The difference |ER, - XR,| = 6.5 (which is significant with the 
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Wilcoxon test) just falls short of achieving significance. Although the result obtained with 
Equation 25.5 is not identical to that obtained with the Wilcoxon test, the two analyses are 
reasonably consistent with one another. 

In the event the binomial signed test for two dependent samples is employed to conduct 
comparisons, a researcher must determine for each comparison the number of subjects who yield 
positive versus negative difference scores. With the exception of Subject 6 in the Condition 2 
versus Condition 3 comparison, all of the difference scores for the three pairwise comparisons 
are positive (since all subjects obtain a higher score in Condition 1 than Condition 2, in 
Condition 1 than Condition 3, and in Condition 2 than Condition 3). For the Condition 1 versus 
Condition 2 and Condition 1 versus Condition 3 comparisons, we must compute P(x = 6) for n 
— 6. For the Condition 2 versus Condition 3 comparison (which does not include Subject 6 in 
the analysis, since the latter subject has a zero difference score), we must compute P(x = 5) for 
n = 5. For all three pairwise comparisons 7+ = n- = .5. 

Employing Table A6 (Table of the Binomial Distribution, Individual Probabilities) (or 
Table A7) in the Appendix we can determine that when n = 6, P(x = 6) 2.0156. Thus, the 
computed two-tailed probability for the Condition 1 versus Condition 2 and Condition 1 versus 
Condition 3 comparisons is (2)(.0156) = .0312. When n = 5, P(x = 5) = .0312. The computed 
two-tailed probability for the Condition 2 versus Condition 3 comparison is (2)(.0312) = .0624. 

As before, if we want to evaluate a nondirectional alternative hypothesis and insure that 
Oy does not exceed .05, the value of «,,. is set equal to .0167. Thus, in order to reject the 
null hypothesis the computed two-tailed binomial probability for a comparison must be equal to 
or less than .0167. Since the computed two-tailed probabilities .0312 (for the Condition 1 versus 
Condition 2 and the Condition 1 versus Condition 3 comparisons) and .0624 (for the Condition 
2 versus Condition 3 comparison) are greater than .0167, none of the pairwise comparisons is 
significant. 

In the event the researcher elects not to control the value of ,,, and employs «pç = .05 
for evaluating the three pairwise comparisons (once again assuming a nondirectional analysis), 
both the Condition 1 versus Condition 2 and Condition 1 versus Condition 3 comparisons are 
significant at the .05 level, since the computed two-tailed probability .0312 (for the Condition 
1 versus Condition 2 and the Condition 1 versus Condition 3 comparisons) is less than .05. The 
Condition 2 versus Condition 3 comparison is not significant, since the computed two-tailed 
probability .0624 for the latter comparison is greater than .05. 

When the results obtained with the binomial sign test are compared with those obtained 
with Equation 25.5 and the Wilcoxon test, it would appear that of the three procedures the 
binomial sign test results in the most conservative test (and thus, as noted previously, the least 
powerful test). However, if one takes into account the obtained binomial probabilities, they are, 
in actuality, not far removed from the probabilities obtained when Equation 25.5 and the 
Wilcoxon test are used. 

In the case of Example 25.1, regardless of which comparison procedure one employs, it 
would appear that unless one uses a very low value for «,.., the Condition 1 versus Condition 
3 comparison is significant. There is some suggestion that the Condition 1 versus Condition 2 
comparison may also be significant, but some researchers would recommend conducting addi- 
tional studies in order to clarify whether or not the two conditions represent different populations. 
Although based on the analyses that have been conducted the Condition 2 versus Condition 3 
comparison does not appear to be significant, it is worth noting that if the researcher uses the 
Wilcoxon test (specifically, the normal approximation not corrected for continuity) to evaluate 
the directional alternative hypothesis H,: 0, > 0, with a. = .05, the latter comparison also 
yields a significant result. Thus, further studies might be in order to clarify the relationship 
between the populations represented by Conditions 2 and 3. 
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The intent of presenting three comparison procedures in this section is to illustrate that, 
generally speaking, the results obtained with the different comparison procedures will be 
reasonably consistent with one another. As is noted in the discussion of comparisons in Section 
VI of both the single-factor between-subjects analysis of variance and the Kruskal-Wallis 
one-way analysis of variance by ranks (see Endnote 7 of the latter test), in instances where two 
or more comparison procedures yield inconsistent results, the most effective way to resolve such 
a problem is by conducting one or more replication studies. By doing the latter, a researcher 
should be able to clarify whether or not an obtained difference is reliable, as well as the 
magnitude of the difference (if, in fact, one exists). 

Itis also noted throughout the book that, in the final analysis, the decision regarding which 
of the available comparison procedures to employ is usually not the most important issue facing 
the researcher conducting comparisons. The main issue is what maximum value one is willing 
to tolerate for &,,,. Additional sources on comparison procedures for the Friedman test 
model are Church and Wike (1979) (who provide a comparative analysis of a number of different 
comparison procedures); Conover (1980, 1999) (who employs an equation that computes a 
CD, value that is different from the one obtained with Equation 25.5); Daniel (1990) (who 
describes a methodology for comparing (k - 1) conditions with a control group, as well as a 
methodology for estimating the size of a difference between the medians of any pair of experi- 
mental conditions); Marascuilo and McSweeney (1977) (who within the framework of a 
comprehensive discussion of the Friedman test model describe a methodology for conducting 
complex comparisons); and Siegel and Castellan (1988) (who also describe the methodology for 
comparing (k - 1) conditions with a control group). 

Marascuilo and McSweeney (1977) also discuss the computation of a confidence interval 
for a comparison for the Friedman test model. One approach for computing a confidence 
interval is to add to and subtract the computed value of CD, from the obtained difference 
between the sums of ranks (or mean ranks, if the equation in Endnote 7 is employed) involved 
in the comparison. The latter approach is based on the same logic employed for computing a 
confidence interval for a comparison in Section VI of the single-factor between-subjects 
analysis of variance. 


VII. Additional Discussion of the Friedman Two-Way Analysis of 
Variance by Ranks 


1. Exact tables of the Friedman distribution Although an exact probability value can be 
computed for obtaining a configuration of ranks that is equivalent to or more extreme than the 
configuration observed in the data evaluated with the Friedman two-way analysis of variance 
by ranks, the chi-square distribution is generally employed to estimate the latter probability. 
Although most sources employ the chi-square approximation regardless of the values of k and 
n, some sources recommend that exact tables be employed when the values of n and/or k are 
small. The exact sampling distribution for the Friedman two-way analysis of variance by 
ranks is based on the use of Fisher's method of randomization (which is discussed in Section 
IX (the Addendum) of the Mann-Whitney U test). 

Tables of exact critical values, which can be viewed as adjusted chi-square values, can be 
found in Marascuilo and McSweeney (1977) and Siegel and Castellan (1988) (who list critical 
values for various values of n between 5 and 13 when the value of k is between 3 and 5). 
Depending upon the values of k and n, exact critical values may be either slightly larger or smaller 
than the critical chi-square values in Table A4. In point of fact, for k = 3 and n = 6 the exact 
tabled critical .05 and .01 values for the Friedman test statistic are respectively b. - 7.00 
and S = 9.00. Since the value xi - 11.08 computed for Example 25.1 is greater than both 
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of the aforementioned critical values, the null hypothesis can still be rejected at both the .05 and 
.01 levels. Although the conclusions with respect to Example 25.1 are the same regardless of 
whether one employs the exact critical values or the chi-square values in Table A4, inspection 
of the two sets of values indicates that the exact .05 critical value is larger than the corresponding 
critical value Xs - 5.99 derived from Table A4, while the reverse is true with respect to the 
exact .01 critical value, which is less than the corresponding value Xi - 9.21 in Table A4. It 
should be noted that for a given value of k, as the value of n increases, the exact critical value 
approaches the tabled chi-square value in Table A4. An additional point of interest relevant 
to evaluating the Friedman test statistic, is that Daniel (1990) and Conover (1980. 1999) cite 
a study by Iman and Davenport (1980) which suggests that the F distribution can be used to 
approximate the sampling distribution for the Friedman test, and that the latter approximation 
may be more accurate than the more commonly employed chi-square approximation. 


2. Equivalency of the Friedman two-way analysis of variance by ranks and the binomial 
sign test for two dependent samples when k 22 In Section I it is noted that when k = 2 the 
Friedman two-way analysis of variance by ranks will yield a result that is equivalent to that 
obtained with the binomial sign test for two dependent samples. To be more specific, the 
Friedman test will yield a result that is equivalent to the normal approximation of the binomial 
sign test for two dependent samples when the correction for continuity is not employed (i.e., 
the result obtained with Equation 19.3).? It should be noted, however, that the two tests will 
only yield an equivalent result when none of the subjects has the same score in the two 
experimental conditions. In the case of the binomial sign test, any subject who has the same 
score in both conditions is eliminated from the data analysis. In the case of the Friedman test, 
however, such subjects are included in the analysis. In order to demonstrate the equivalency of 
the two tests, Equation 25.1 will be employed to analyze the data for Example 19.1 (which was 
previously evaluated with the binomial sign test for two dependent samples). In using 
Equation 25.1 the data for Subject 2 are not included, since the latter subject has identical scores 
in both conditions. Thus, in our analysis n = 9 and k = 2. Table 25.3 summarizes the rank- 
ordering of data for Example 19.1 within the framework of the Friedman test model. 


Table 25.3 Summary of Data for Example 19.1 for Friedman Test Model 


Condition 1 Condition 2 
X, R, X, R, 
Subject 1 9 2 8 1 
Subject 2 2 1.5 2 1.5 
Subject 3 1 1 3 2 
Subject 4 4 2 2 1 
Subject 5 6 2 3 1 
Subject 6 4 2 0 1 
Subject 7 7 2 4 1 
Subject 8 8 2 5 1 
Subject 9 5 2 4 1 
Subject 10 1 2 0 1 
ER, = 185 DR, = 11.5 


Since the data for Subject 2 are not included in the analysis, the rank-orders for Subject 2 
under the two conditions are subtracted from the values XR, = 18.5 and XR, = 11.5, yielding 
the revised values XR, = 17 and XR, = 10 which are employed in Equation 25.1. Employing 
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the latter equation, the value Xx: = 5.44 is computed. 


12 


= 77 . .[7Y + A04 - (3)(9)(3) = 5.44 
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Employing Equation 25.2, df= 2 - 1 = 1. For df= 1, the tabled critical .05 and .01 chi- 
square values are X'os - 3.84 and Xi - 6.63. Since the obtained value xi - 5.44 is greater 
than Xs = 3.84, the alternative hypothesis is supported at the .05 level. It is not, however, 
supported at the .01 level, since X? = 5.44 is less than x4, = 6.63.“ 

Equation 19.3 yields the value z = 2.33 for the same set of data. Since the square of a z 
value will equal the corresponding chi-square value computed for the same set of data, z? should 
equal X. In point of fact, (z = 2.33) = Ca = 5.44) (the minimal discrepancy is due to 
rounding off error). It is also the case that the square of the tabled critical z value employed for 
the normal approximation of the binomial sign test for two dependent samples will always 
equal the tabled critical chi-square value employed for the Friedman test at the same level of 
significance. Thus, the square of the tabled critical two-tailed value zo, = 1.96 employed for 
the normal approximation of the binomial sign test equals Xs = 3.84 employed for the 
Friedman test (i.e., (z = 1.96 = (y? = 3.84)). 


3. Power-efficiency of the Friedman two-way analysis of variance by ranks Daniel (1989) 
notes that Noether (1967) states when the underlying population distributions are normal, the 
asymptotic relative efficiency (which is discussed in Section VII of the Wilcoxon signed-ranks 
test (Test 6)) of the Friedman two-way analysis of variance by ranks (relative to the single- 
factor within-subjects analysis of variance) is .955k/(k + 1). Thus, when k = 2 the asymptotic 
relative efficiency of the Friedman test is .64, but when k = 10 it equals .87. For a uniform 
distribution, the asymptotic relative efficiency of the Friedman test is k/(k + 1). 


4. Alternative nonparametric rank-order procedures for evaluating a design involving k 
dependent samples In addition to the Friedman two-way analysis of variance by ranks, a 
number of other nonparametric procedures for two or more dependent samples have been 
developed that can be employed with ordinal data. Among the more commonly cited alternative 
procedures are the following: a) Marascuilo and McSweeney (1977) describe the extension of 
the van der Waerden normal-scores test for k independent samples (Test 23) (Van der 
Waerden (1953/1953) to a design involving k dependent samples. Conover (1980. 1999) notes 
that the normal-scores test developed by Bell and Doksum (1965) can also be extended to the 
latter design; b) Page's test for ordered alternatives (Page (1963)) can be employed with k 
dependent samples to evaluate an ordered alternative hypothesis. Specifically, in stating the 
alternative hypothesis, the ordinal position of the treatment effects is stipulated (as opposed to 
just stating that a difference exists between at least two of the k experimental conditions). Page's 
test for ordered alternatives is described in Daniel (1990), Marascuilo and McSweeney (1977), 
and Siegel and Castellan (1988); and c) Additional tests that can be employed with a k dependent 
samples design are either discussed or referenced in Conover (1980, 1999), Daniel (1990), 
Hollander and Wolfe (1999), Marascuilo and McSweeney (1977), and Sheskin (1984). 


5. Relationship between the Friedman two-way analysis of variance by ranks and 
Kendall’s coefficient of concordance The Friedman two-way analysis of variance by ranks 
and Kendall's coefficient of concordance (Test 31) (which is one of a number of measures of 
association that are described in this book) are based on the same statistical model. The latter 
measure of association is employed with three or more sets of ranks when rankings are based on 
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the Friedman test protocol. A full discussion of the relationship between the Friedman two- 
way analysis of variance by ranks and Kendall’s coefficient of concordance (which can be 
used as a measure of effect size for the Friedman test) can be found in Section VII of Kendall’s 
coefficient of concordance. When there are n = 2 subjects/sets of matched subjects, Spear- 
man's rank-order correlation coefficient (Test 29), which is linearly related to Kendall’s 
coefficient of concordance, can be conceptualized within the framework of the Friedman test 
model. The latter relationship is discussed in Section VII of Spearman's rank-order corre- 
lation coefficient. 


VIII. Additional Examples Illustrating the Use of the Friedman 
Two-Way Analysis of Variance by Ranks 


The Friedman two-way analysis of variance by ranks can be employed to evaluate any of the 
additional examples noted for the single-factor within-subjects analysis of variance, if the data 
for the latter examples are rank-ordered. In addition, the Friedman test can be used to evaluate 
the data for any of the additional examples noted for the ¢ test for two dependent samples/ 
binomial sign test for two dependent samples/Wilcoxon matched-pairs signed-ranks test. 
Example 25.2 is an additional example that can be evaluated with the Friedman two-way 
analysis of variance by ranks. In Example 25.2 there is no need to rank-order interval/ratio 
data, since the results of the study are summarized in a rank-order format.? 


Example 25.2 Six horses are rank-ordered by a trainer with respect to their racing form on 
three different surfaces. Specifically, Track A has a cement surface, Track B a clay surface, and 
Track C a grass surface. Except for the surface, the three tracks are comparable to one another 
in all other respects. Table 25.4 summarizes the rankings of the horses on the three tracks. (In 
the case of Horse 6, the rank of 1.5 for both the clay and grass tracks reflects the fact that the 
horse was perceived to have equal form on both surfaces.) Do the data indicate that the form 
of a horse is related to the surface on which it is racing? 


Table 25.4 Data for Example 25.2 


Track A Track B Track C 

(Cement) (Clay) (Grass) 
Horse 1 3 2 1 
Horse 2 3 2 1 
Horse 3 3 2 1 
Horse 4 3 2 1 
Horse 5 3 2 1 
Horse 6 3 1.5 1.5 


Since the ranks employed in Example 25.2 are identical to those employed for Example 
25.1, the Friedman test will yield the identical result. Since most people would probably be 
inclined to employ a rank of | to represent a horse’s best surface and a rank of 3 to represent a 
horse’s worst surface, using such a ranking protocol, the track with the lowest sum of ranks 
(Track C) is associated with the best racing form and the track with the highest sum of ranks 
(Track A) is associated with the worst racing form. 
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Endnotes 


1. The reader should take note of the fact that when there are k = 2 dependent samples, the 
Wilcoxon matched-pairs signed-ranks test (which is also described in this book as a 
nonparametric test for evaluating ordinal data) will not yield a result equivalent to that 
obtained with the Friedman two-way analysis of variance by ranks. Since the Wilcoxon 
test (which rank-orders interval/ratio difference scores) employs more information than the 
Friedman test/binomial sign test, it provides a more powerful test of an alternative hy- 
pothesis than the latter tests. 


2. Amore detailed discussion of the guidelines noted below can be found in Sections I and 
VII of the ¢ test for two dependent samples. 


3. Although it is possible to conduct a directional analysis, such an analysis will not be 
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10. 


described with respect to the Friedman two-way analysis of variance by ranks. A 
discussion of a directional analysis when k = 2 can be found under the binomial sign test 
for two dependent samples. A discussion of the evaluation of a directional alternative 
hypothesis when k > 3 can be found in Section VII of the chi-square goodness-of-fit test 
(Test 8). Although the latter discussion is in reference to analysis of a k independent 
samples design involving categorical data, the general principles regarding analysis of a 
directional alternative hypothesis when k > 3 are applicable to the Friedman two-way 
analysis of variance by ranks. 


Note that this ranking protocol differs from that employed for other rank-order procedures 
discussed in the book. In other rank-order tests, the rank assigned to each score is based 
on the rank-order of the score within the overall distribution of nk = N scores. 


As noted in Section IV, the chi-square distribution provides an approximation of the 
Friedman test statistic. Although the chi-square distribution provides an excellent approx- 
imation of the Friedman sampling distribution, some sources recommend the use of exact 
probabilities for small sample sizes. Exact tables of the Friedman distribution are discussed 
in Section VII. 


In the discussion of comparisons in reference to the analysis of variance, it is noted that a 
simple (also known as a pairwise) comparison is a comparison between any two groups/ 
conditions in a set of k groups/conditions. 


An alternative form of the comparison equation, which identifies the minimum required 
difference between the means of the ranks of any two conditions in order for them to 
differ from one another at the prespecified level of significance, is noted below. 


p k(k + 1) 
Ry Ga- Rp “adi 6n 


If the CD,, value computed with Equation 25.5 is divided by n, it yields the value 
CD D ERN computed with the above equation. 
The method for deriving the value of z, d for the Friedman two-way analysis of vari- 
ance by ranks is based on the same logic that is employed in Equation 22.5 (which is used 
for conducting comparisons for the Kruskal-Wallis one-way analysis of variance by 
ranks). A rationale for the use of the proportions .0167 and .0083 in determining the 
appropriate value for z, d in Example 25.1 can be found in Endnote 5 of the Kruskal- 
Wallis one-way analysis of variance by ranks. 


It should be noted that when a directional alternative hypothesis is employed, the sign of 
the difference between the two sums of ranks must be consistent with the prediction stated 
in the directional alternative hypothesis. When a nondirectional alternative hypothesis is 
employed, the direction of the difference between two sums of ranks is irrelevant. 


In the case of both the Wilcoxon matched-pairs signed-ranks test and the binomial sign 
test for two dependent samples, it is assumed that for each pairwise comparison a subject's 
score in the second condition that is listed for a comparison is subtracted from the subject's 
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11. 


12. 


13. 


14. 


15. 


score in the first condition that is listed for the comparison. In the case of both tests, 
reversing the order of subtraction will yield the same result. 


The value n 2 6 is employed for the Condition 1 versus Condition 2 and Condition 1 versus 
Condition 3 comparisons, since no subject has the same score in both experimental 
conditions. On the other hand, the value n 2 5 is employed in the Condition 2 versus 
Condition 3 comparison, since Subject 6 has the same score in Conditions 2 and 3. The use 
of n 2 5 is predicated on the fact that in conducting the Wilcoxon matched-pairs signed- 
ranks test, subjects who have a difference score of zero are not included in the computation 
of the test statistic. 


In Equation 25.5 the value z y, = 1.96 is employed for z; dj and the latter value is multi- 
plied by 3.46, which is the value computed for the term in the radical of the equation for 
Example 25.1. 


It is also the case that the exact binomial probability for the binomial sign test for two 
dependent samples will correspond to the exact probability for the Friedman test 
statistic. 


If Subject 2 is included in the analysis, Equation 25.1 yields the value Xi = 4.9 which is 
also significant at the .05 level. 


In Section I it is noted that in employing the Friedman test it is assumed that the variable 
which is ranked is a continuous random variable. Thus, it would be assumed that the racing 
form of a horse was at some point either explicitly or implicitly expressed as a continuous 
interval/ratio variable. 
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Test 26 
The Cochran Q Test 


(Nonparametric Test Employed with Categorical/Nominal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Hypothesis evaluated with test In a set of k dependent samples (where k > 2), do at least two 
of the samples represent different populations? 


Relevant background information on test It is recommended that before reading the material 
on the Cochran Q test, the reader review the general information on a dependent samples design 
contained in Sections I and VII of the £ test for two dependent samples (Test 17). The 
Cochran Q test (Cochran (1950)) is a nonparametric procedure for categorical data employed 
in a hypothesis testing situation involving a design with k = 2 or more dependent samples. The 
test is employed to evaluate an experiment in which a sample of n subjects (or n sets of matched 
subjects) is evaluated on a dichotomous dependent variable (1.e., scores on the dependent variable 
must fall within one of two mutually exclusive categories). The test assumes that each of the n 
subjects (or n sets of matched subjects) contributes k scores on the dependent variable. The 
Cochran Q test is an extension of the McNemar test (Test 20) to a design involving more than 
two dependent samples, and when k = 2 the Cochran Q test will yield a result that is equivalent 
to that obtained with the McNemar test. If the result of the Cochran Q test is significant, it 
indicates there is a high likelihood at least two of the k experimental conditions represent 
different populations. 

The Cochran Q test is based on the following assumptions: a) The sample of n subjects 
has been randomly selected from the population it represents; and b) The scores of subjects are 
in the form of a dichotomous categorical measure involving two mutually exclusive categories. 

Although the chi-square distribution is generally employed to evaluate the Cochran test 
statistic, in actuality the latter distribution is used to provide an approximation of the exact sam- 
pling distribution. Sources on nonparametric analysis (e.g., Daniel (1990), Marascuilo and 
McSweeney (1977), and Siegel and Castellan (1988)) recommend that for small sample sizes 
exact tables of the Q distribution derived by Patil (1975) be employed. Use of exact tables is 
generally recommended when n « 4 and/or nk « 24. 

As is the case for other tests that are employed to evaluate data involving two or more 
dependent samples, in order for the Cochran Q test to generate valid results the following 
guidelines should be adhered to:' a) To control for order effects, the presentation of the k ex- 
perimental conditions should be random or, if appropriate, be counterbalanced; and b) If matched 
samples are employed, within each set of matched subjects each of the subjects should be 
randomly assigned to one of the k experimental conditions. 

As is noted with respect to other tests that are employed to evaluate a design involving two 
or more dependent samples, the Cochran Q test can also be used to evaluate a before—after 
design, as well as extensions of the latter design that involve more than two measurement periods. 
The limitations of the before—after design (which are discussed in Section VII of the f test for 
two dependent samples) are also applicable when it is evaluated with the Cochran Q test. 
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II. Example 


Example 26.1 A market researcher asks 12 female subjects whether or not they would purchase 
an automobile manufactured by three different companies. Specifically, subjects are asked 
whether they would purchase a car manufactured by the following automobile manufacturers: 
Chenesco, Howasaki, and Gemini. The responses of the 12 subjects follow: Subject 1 said she 
would purchase a Chenesco and a Howasaki but not a Gemini; Subject 2 said she would only 
purchase a Howasaki; Subject 3 said she would purchase all three makes of cars; Subject 4 
said she would only purchase a Howasaki; Subject 5 said she would only purchase a Howasaki; 
Subject 6 said she would purchase a Howasaki and a Gemini but not a Chenesco; Subject 7 
said she would not purchase any of the automobiles; Subject 8 said she would only purchase a 
Howasaki; Subject 9 said she would purchase a Chenesco and a Howasaki but not a Gemini; 
Subject 10 said she would only purchase a Howasaki; Subject 11 said she would not purchase 
any of the automobiles; and Subject 12 said she would only purchase a Gemini. Can the market 
researcher conclude that there are differences with respect to car preference based on the 
responses of subjects? 


III. Null versus Alternative Hypotheses 


In stating the null and alternative hypotheses the notation 7, will be employed to represent the 
proportion of Yes responses in the population represented by the j " experimental condition. 
Stated more generally, T; represents the proportion of responses in one of the two response 
categories in the population represented by the j " experimental condition. 

Null hypothesis Hym,-7,-m 


2 3 


(The proportion of Yes responses in the population represented by Condition 1 equals the 
proportion of Yes responses in the population represented by Condition 2 equals the proportion 
of Yes responses in the population represented by Condition 3.) 


Alternative hypothesis H,: Not H, 


(This indicates that in at least two of the underlying populations represented by the k = 3 
conditions, the proportion of Yes responses are not equal. It is important to note that the 
alternative hypothesis should not be written as follows: H,: n, * T, + m4. The reason why 
the latter notation for the alternative hypothesis is incorrect is because it implies that all three 
population proportions must differ from one another in order to reject the null hypothesis. In this 
book it will be assumed (unless stated otherwise) that the alternative hypothesis for the Cochran 
Q test is stated nondirectionally.) 


IV. Test Computations 


The data for Example 26.1 are summarized in Table 26.1. The number of subjects employed in 
the experiment is n = 12, and thus within each condition there are n = n, = n, = n, = 12 
scores. The values 1 and 0 are employed to represent the two response categories in which a 
subject’s response/categorization may fall. Specifically, a score of 1 indicates a Yes response 
and a score of 0 indicates a No response. 

The following summary values are computed in Table 26.1 which will be employed in the 
analysis of the data: 

a) The value XC represents the number of Yes responses in the j condition. Thus, the 
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number L Yes responses in Conditions 1, 2, and 3 are, respectively, XC, = 3, ZG, = 9, and 
zc. 

b) n value ÈC, Y represents the square of the XC, value computed for the j " condition. 
The sum of the k 2 3 (xc, Y! scores can be represented by the notation LOC, Y. Thus, for 
Example 26.1, ECY = (XC, + (UC)? + CC, = 9 + 81 +9 = 99. 

c) The value R, represents the sum of the k = 3 scores of the i” subject (i.e., the number 
of Yes responses for the i " subject). Note that an R, value is computed for each of the n = 12 
subjects. The sum of the n R, scores is LR,. Thus, for Example 26.1, XR, =2 + 1 +- +0 + 
] 215. 

d) The value R? represents the square of the Bi score of the i” subject. The sum of the n 
R? scores is XR]. Thus, for Example 26.1, XR? =4+1+--+0+1= s 

e) The value p; represents the proportion of Yes responses in the j " condition. The value 
of P; is computed as ; follows: P; = = EC, / n. Thus, in Table 26.1 the values of P; for Conditions 
1, 2; and 3 are, respectively, p, = .25, p, = .75,and p, = .25. 


Table 26.1 Data for Example 26.1 


Chenesco Howasaki Gemini 

C, C; C, R; R? 
Subject 1 1 1 0 2 4 
Subject 2 0 1 0 1 1 
Subject 3 1 1 1 3 9 
Subject 4 0 1 0 1 1 
Subject 5 0 1 0 1 1 
Subject 6 0 1 1 2 4 
Subject 7 0 0 0 0 0 
Subject 8 0 1 0 1 1 
Subject 9 1 1 0 2 4 
Subject 10 0 1 0 1 1 
Subject 11 0 0 0 0 0 
Subject 12 0 0 1 1 1 
EC = EC, =9 xc eg ER, = 15 ER; =27 

€Ccy-Gy-9 (LC) = (9P = 81 (XC, = BY =9 

XC, 3 XC 9 XC 3 

IE pr sts e Bu sc. s 95 
n" n 12 Pa n, 12 "n n ^ 12 


Equation 26.1 is employed to calculate the test statistic for the Cochran Q test. The Q 
value computed with Equation 26.1 is interpreted as a chi-square value. In Equation 26.1 the 
following notation is employed with respect to the summary values noted in this section: a) C 
is employed to represent the value computed for EOC, Y^; b) T is employed to represent the 
value computed for XR ; and c) Ris employed to represent the value computed for XE. Thus, 
for Example 26.1, C = 99, T215,ánd R = 27^ 


. k- DIKO -DA . 
Q "ary R (Equation 26.1) 


Substituting the appropriate values from Example 26.1 in Equation 26.1, the value Q = 8 
is computed.‘ 


Q = G - DIGO - 05%] _ 
(05) - 27 
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V. Interpretation of the Test Results 


In order to reject the null hypothesis, the computed value Q = y? must be equal to or greater 
than the tabled critical chi-square value at the prespecified level of significance. The computed 
chi-square value is evaluated with Table A4 (Table of the Chi-Square Distribution) in the 
Appendix. For the appropriate degrees of freedom, the tabled Xos value (which is the chi-square 
value at the 95th percentile) and the tabled Xo value (which is the chi-square value at the 99th 
percentile) are employed as the .05 and .01 critical values for evaluating a nondirectional 
alternative hypothesis. The number of degrees of freedom employed in the analysis are 
computed with Equation 26.2. Thus, df23- 1 22. 


df=k-1 (Equation 26.2) 


For df= 2, the tabled critical .05 and .01 chi-square values are Xs = 5.99 and Xo =9.21. 
Since the computed value Q = 8 is greater than Xs = 5.99, the alternative hypothesis is sup- 
ported at the .05 level. Since, however, Q = 8 is less than Xo = 9.21, the alternative hypothesis 
is not supported at the .01 level. A summary of the analysis of Example 26.1 with the Cochran 
C test follows: It can be concluded that there is a significant difference in subjects’ preferences 
for at least two of the three automobiles. This result can be summarized as follows: Q(2) = 8, 
p<.05. 


VI. Additional Analytical Procedures for the Cochran Q Test 
and/or Related Tests 


1. Pairwise comparisons following computation of the test statistic for the Cochran Q test 
Prior to reading this section the reader should review the discussion of comparisons in Section 
VI of the single-factor between-subjects analysis of variance (Test 21). As is the case 
with the omnibus F value computed for an analysis of variance, the Q value computed with 
Equation 26.1 is based on an evaluation of all k experimental conditions. When the value of Q 
is significant, it does not indicate whether just two or, in fact, more than two conditions differ 
significantly from one another. In order to answer the latter question, it is necessary to conduct 
comparisons contrasting specific conditions with one another. This section will describe method- 
ologies that can be employed for conducting simple/pairwise comparisons following the com- 
putation of a Q value? 

In conducting a simple comparison, the null hypothesis and nondirectional alternative hy- 
pothesis are as follows: H,: t, = m, versus H,: m, * m,. In the aforementioned hypotheses, 
T, and m, represent the proportion of Yes responses in the populations represented by the two 
conditions involved in the comparison. The alternative hypothesis can also be stated 
directionally as follows: H,: m, > n, or Hi: 1, < T,- 

A number of sources (e.g., Fleiss (1981) and Marascuilo and McSweeney (1977)) describe 
comparison procedures for the Cochran Q test. The procedure to be described in this section, 
which is one of two procedures described in Marascuilo and McSweeney (1977), is essentially 
the application of the Bonferroni-Dunn method described in Section VI of the single-factor 
between-subjects analysis of variance to the Cochran Q test model. Through use of Equation 
26.3, the procedure allows a researcher to identify the minimum required difference between the 
observed proportion of Yes responses for any two experimental conditions (designated as CD.) 
in order for them to differ from one another at the prespecified level of significance. 
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(KT) - R 


g (Equation 26.3) 
nkk - 1) 


CD, - Zadj | 2 


The value of z, d is obtained from Table A1 (Table of the Normal Distribution) in 
the Appendix. In the case of a nondirectional alternative hypothesis, z, d is the z value above 
which a proportion of cases corresponding to the value c, /2c falls (where c is the total number 
of comparisons that are conducted). In the case of a directional alternative hypothesis, z, d is the 
z value above which a proportion of cases corresponding to the value &,,,/c falls. When all 
possible pairwise comparisons are made c = [k(k - 1)]/2, and thus, 2c = k(k - 1). In Example 
26.1 the number of pairwise/simple comparisons that can be conducted is c = [3(3 - 1)]/22 3 
— specifically, Condition 1 versus Condition 2, Condition 1 versus Condition 3, and Condition 
2 versus Condition 3. 

The value of z,,, will be a function of both the maximum familywise Type I error rate 
(0,4) the researcher is willing to tolerate and the total number of comparisons that are 
conducted. When a limited number of comparisons are planned prior to collecting the data, most 
sources take the position that a researcher is not obliged to control the value of ,,,. In such a 
case, the per comparison Type I error rate («,,.) will be equal to the prespecified value of 
alpha. When «pw is not adjusted, the value of z, d employed in Equation 26.3 will be the tabled 
critical z value that corresponds to the prespecified level of significance. Thus, if a 
nondirectional alternative hypothesis is employed and & = &pç = .05, the tabled critical two- 
tailed .05 value z,, = 1.96 is used to represent 2 adj in Equation 26.3. If @ = a. = .01, the 
tabled critical two-tailed .01 value z „ = 2.58 is used in Equation 26.3. In the same respect, if 
a directional alternative hypothesis is employed, the tabled critical 05 and .01 one-tailed values z 5, = 1.65 
and zy, = 2.33 are used for 2 adj in Equation 26.3. 

When comparisons are not planned beforehand, it is generally acknowledged that the value 
of «,,,, must be controlled so as not to become excessive. The general approach for controlling 
the latter value is to establish a per comparison Type I error rate which insures that c, will 
not exceed some maximum value stipulated by the researcher. One method for doing this 
(described under the single-factor between-subjects analysis of variance as the Bonferroni- 
Dunn method) establishes the per comparison Type I error rate by dividing the maximum 
value one will tolerate for the familywise Type I error rate by the total number of comparisons 
conducted. Thus, in Example 26.1, if one intends to conduct all three pairwise comparisons 
and wants to insure that €, does not exceed .05, ap. = Qul c = .05/3 = .0167. The latter 
proportion is used in determining the value of z, dj" As noted earlier, if a directional alternative 
hypothesis is employed for a comparison, the value of z, dj employed in Equation 26.3 is the 
z value above which a proportion equal to «pç = Q,y,/c of the cases falls. In Table A1, the 
z value that corresponds to the proportion .0167 is z = 2.13. By employing z, d in Equation 
26.3, one can be assured that within the “family” of three pairwise comparisons, @,,, will not 
exceed .05 (assuming all of the comparisons are directional). If a nondirectional alternative 
hypothesis is employed for all of the comparisons, the value of z,, will be the z value 
above which a proportion equal to 0,,,/2c = «,./2 of the cases falls. Since «,./2 = .0167/2 
= .0083, z = 2.39. By employing Za in Equation 26.3, one can be assured that œ pẹ will not 
exceed .05.° 








Table 26.2 Difference Scores Between Pairs of Proportions for Example 26.1 


|p, ~P,| = 125-.75| = .50 |p, -p| = |.25-.25| = 0. Ip, p] = |.75-.25| = .50 
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In order to employ the CD, value computed with Equation 26.3, it is necessary to 
determine the absolute value of the difference between the proportion of Yes responses for each 
pair of experimental conditions that are compared.’ Table 26.2 summarizes the difference scores 
between pairs of proportions. 

If any of the differences between two proportions is equal to or greater than the CD, value 
computed with Equation 26.3, a comparison is declared significant. Equation 26.3 will now be 
employed to evaluate the nondirectional alternative hypothesis H,: v, * 1, for all three pair- 
wise comparisons. Since it will be assumed that the comparisons are unplanned and that the 
researcher does not want the value of «pwy to exceed .05, the value z; d^ 2.39 will be used in 
computing CD... 





(35) - 27 


= (2.39)(.204) = .49 
(12)°(3)(3 - 1) 


CD, = (2.39) | 








The obtained value CD, = .49 indicates that any difference between a pair of proportions 
that is equal to or greater than .49 is significant. With respect to the three pairwise comparisons, 
the difference between Condition 1 and Condition 2 (which equals .50) and the difference be- 
tween Condition 2 and Condition 3 (which also equals .50) are significant, since they are both 
greater than CD, = .49. We cannot conclude that the difference between Condition 1 and 
Condition 3 is significant, since |p, - p,| = 0 is less than CD, = .49. 

An alternative strategy that can be employed for conducting pairwise comparisons for the 
Cochran Q test is to use the McNemar test for each comparison. In employing the McNemar 
test one can employ either the chi-square or normal approximation of the test statistic for each 
comparison (the continuity-corrected value generally providing a more accurate estimate for a 
small sample size), or compute the exact binomial probability for the comparison. It will be 
demonstrated in Section VII, that the computed chi-square value computed with Equation 20.1 
for the McNemar test yields the same Q = X? value that is obtained if a Cochran Q test 
(Equation 26.1) is employed to compare the same set of experimental conditions. In this section 
the exact binomial probabilities for the three pairwise comparisons for the McNemar test model 
will be computed. In order to compute the exact binomial probabilities (or, for that matter, the 
chi-square or normal approximations of the test statistic), the data for each comparison must be 
placed within a 2 x 2 table like Table 20.1 (which is the table for the McNemar test model). To 
illustrate this, the data for the Condition 1 versus Condition 2 comparison are recorded in Table 
26.3. Note that of the n = 12 subjects involved in the comparison, only 6 of the subjects’ scores 
are actually taken into account in computing the test statistic, since the other 6 subjects have the 
same score in both conditions. 

Employing Table A6 (Table of the Binomial Distribution, Individual Probabilities) in 
the Appendix to compute the binomial probability, we determine that when n = 6, P(x = 6) 
= .0156. Thus, the two-tailed binomial probability for the Condition 1 versus Condition 2 com- 
parison is (2)(.0156) = .0312. In the case of the Condition 1 versus Condition 3 comparison, the 
frequencies for Cells a, b, c, and d are, respectively, 1, 2, 2, and 7. Since the frequency of both 
Cells b and c is 2 (and thus, n = 4), the Condition 1 versus Condition 3 comparison results in no 
difference. In the case of the Condition 2 versus Condition 3 comparison, the frequencies for 
Cells a, b, c, and d are, respectively, 2, 1, 7, and 2. For the latter comparison, since n = 8, the 
frequencies for Cells b and c are 1 and 7. Using Table A6 (or Table A7 which is the Table of 
the Binomial Distribution, Cumulative Probabilities), we determine that when n = 8, P(x > 
7) = .0352. Thus, the two-tailed binomial probability for the Condition 2 versus Condition 3 
comparison is (2)(.0352) = .0704. Note that the binomial probabilities computed for the 
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Table 26.3 McNemar Test Model for Binomial Analysis of Condition 1 
Versus Condition 2 Comparison 


Condition 1 Row sums 
Yes (1) No (0) 
e Yes (1) a=3 b=6 9 
Condition 2 No (0) c=0 d=3 3 
Column sums 3 9 12 


Condition 1 versus Condition 2 and Condition 2 versus Condition 3 comparisons are not 
identical, since by virtue of eliminating subjects who respond in the same category in both 
conditions from the analysis, the two comparisons employ different values for n (which is the 
sum of the frequencies for Cells b and c). 

As before, if we wish to evaluate a nondirectional alternative hypothesis and insure that 
Oy does not exceed .05, the value of «,. is set equal to .0167. Thus, in order to reject the 
null hypothesis the computed two-tailed binomial probability for a comparison must be equal to 
orless than .0167. Since the computed two-tailed probabilities .0312 (for the Condition 1 versus 
Condition 2 comparison) and .0704 (for the Condition 2 versus Condition 3 comparison) are 
greater than .0167, none of the pairwise comparisons is significant. 

In the event the researcher elects not to control the value of «,,,, and employs a, = .05 
in evaluating the three pairwise comparisons (once again assuming a nondirectional analysis), 
only the Condition 1 versus Condition 2 comparison is significant, since the computed two-tailed 
binomial probability .0312 is less than .05. The Condition 2 versus Condition 3 comparison falls 
short of significance, since the computed two-tailed binomial probability .0704 is greater than 
.05. It should be noted that both of the aforementioned comparisons are significant if the 
directional alternative hypothesis that is consistent with the data is employed (since the one-tailed 
probabilities .0156 and .0352 are less than .05). If Equation 26.3 is employed for the same set 
of comparisons, CD. = (1.96)(.204) = .40.8 Thus, employing the latter equation, the 
Condition 1 versus Condition 2 and Condition 2 versus Condition 3 comparisons are significant, 
since in both instances the difference between the two proportions is greater than CD, = .40. 

Although the binomial probabilities for the McNemar test for the Condition 1 versus Con- 
dition 2 and Condition 2 versus Condition 3 comparisons are larger than the probabilities 
associated with the use of Equation 26.3, both comparison procedures yield relatively low prob- 
ability values for the two aforementioned comparisons. Thus, in the case of Example 26.1, 
depending upon which comparison procedure one employs (as well as the value of «pç and 
whether one evaluates a nondirectional or directional alternative hypothesis) it would appear that 
there is a high likelihood that the Condition 1 versus Condition 2 and Condition 2 versus 
Condition 3 comparisons are significant. The intent of presenting two different comparison pro- 
cedures in this section is to illustrate that, generally speaking, the results obtained with different 
procedures will be reasonably consistent with one another? As is noted in the discussion of 
comparisons in Section VI of the single-factor between-subjects analysis of variance, in 
instances where two or more comparison procedures yield inconsistent results, the most effective 
way to clarify the status of the null hypothesis is to replicate a study one or more times. It is also 
noted throughout the book that, in the final analysis, the decision regarding which of the 
available comparison procedures to employ is usually not the most important issue facing the 
researcher conducting comparisons. The main issue is what maximum value one is willing to 
tolerate for Oy. 

Marascuilo and McSweeney (1977) discuss the computation of a confidence interval for 
a comparison for the Cochran Q test model. One approach for computing a confidence interval 
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is to add to and subtract the computed value of CD, from the obtained difference between the 
proportions involved in the comparison. The latter approach is based on the same logic 
employed for computing a confidence interval for a comparison in Section VI of the single- 
factor between-subjects analysis of variance. 


VII. Additional Discussion of the Cochran Q Test 


1. Issues relating to subjects who obtain the same score under all of the experimental 
conditions 

a) Cochran (1950) noted that since the value computed for Q is not affected by the scores 
of any subject (or any row, if matched subjects are employed) who obtains either all Os or all 1s 
in each of the experimental conditions, the scores of such subjects can be deleted from the data 
analysis. If the latter is done with respect to Example 26.1, the data for Subjects 3, 7, and 11 can 
be eliminated from the analysis (since Subjects 7 and 11 obtain all Os, and Subject 3 obtains all 
1s). It is demonstrated below that if the scores of Subjects 3, 7, and 11 are eliminated from the 
analysis, the value Q = 8 is still obtained when the revised summary values are substituted in 
Equation 26.1. 


XC, -2 XC,-8 XG-2 (Xoy-4 Q(€Oy-64 (XCOy-4 
EEC} = 4 + 64+4=72 XR, = 12 XR, = 18 
Thus: C- 72 T=12 R= 18 


Q = G - 1)[(3)(72) - 027 _ 8 
(3)(12) - 18 





It is noted in Section VI of the McNemar test that the latter test essentially eliminates from 
the analysis any subject who obtains the same score under both experimental conditions, and that 
this represents a limitation of the test. What was said with regard to the McNemar test in this 
respect also applies to the Cochran Q test. Thus, it is entirely possible to obtain a significant 
Q value even if the overwhelming majority of the subjects in a sample obtain the same score in 
each of the experimental conditions. To illustrate this, the value Q = 8 (obtained for Example 
26.1) can be obtained for a sample of 1009 subjects, if 1000 of the subjects obtained a score of 
1 in all three experimental conditions, and the remaining nine subjects had the same scores as 
Subjects 1, 2, 4, 5, 6, 8, 9, 10, and 12 in Example 26.1. Since the computation of the Q value in 
such an instance will be based on a sample size of 9 rather than on the actual sample size of 1009, 
it is reasonable to assume that such a result, although statistically significant, will not be of any 
practical significance from the perspective of the three automobile manufacturers. The latter 
statement is based on the fact that since all but 9 of the 1009 subjects said they would buy all 
three automobiles, there really does not appear to be any differences in preference that will be 
of any economic consequence to the manufacturers. 

b) In Section it is noted that when n « 4 and/or nk « 24, it is recommended that tables for 
the exact Cochran test statistic (derived by Patil (1975)) be employed instead of the chi-square 
approximation. In making such a determination, the value of n that should be used should not 
include any subjects who obtain all Os or all 1s in each of the experimental conditions. Thus, in 
Example 26.1, the value n = 9 is employed, and not the value n = 12. Consequently nk = (9)(3) 
= 27. Since n > 4 and nk > 24, it is acceptable to employ the chi-square approximation. 

c) Note that Equation 26.3 (the equation employed for conducting comparisons) employs 
the value of n for the total number of subjects, irrespective of whether a subject obtains the same 
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score in all k experimental conditions. The use of the latter n value maximizes the power of the 
comparison procedure. Certainly one could make an argument for employing as n in Equation 
26.3 the number of subjects who have at least two different scores in the k experimental con- 
ditions. In most instances, the latter n value will be less than the total number of subjects 
employed in an experiment, and, subsequently, if the smaller n value is employed in Equation 
26.3, the comparison procedure will be more conservative (since it will result in a higher value 
for CD,). 


2. Equivalency of the Cochran Q test and the McNemar test when k 22 In Section I it is 
noted that when k = 2 the Cochran Q test yields a result that is equivalent to that obtained with 
the McNemar test. To be more specific, the Cochran Q test will yield a result that is equivalent 
to the McNemar test statistic when the correction for continuity is not employed for the latter 
test (1.e., the result obtained with Equation 20.1). In order to demonstrate the equivalency of the 
two tests, Example 26.2 will be evaluated with both tests. 


Example 26.2 A market researcher asks 10 female subjects whether or not they would purchase 
an automobile manufactured by two different companies. Specifically, subjects are asked 
whether they would purchase an automobile manufactured by Chenesco and Howasaki. Except 
for Subjects 2 and 3, all of the subjects said they would purchase a Chenesco but would not 
purchase a Howasaki. Subject 2 said she would not purchase either car, while Subject 3 said 
she would purchase a Howasaki but not a Chenesco. Based on the responses of subjects, can 
the market researcher conclude that there are differences with respect to car preference? 


Tables 26.4 and 26.5 respectively summarize the data for the study within the framework 
of the Cochran Q test model and the McNemar test model. 


Table 26.4 Summary of Data for Analysis of Example 26.2 with Cochran Q Test 


Chenesco Howasaki 


C, G R? 


B 


Subject 1 
Subject 2 
Subject 3 
Subject 4 
Subject 5 
Subject 6 
Subject 7 
Subject 8 
Subject 9 
Subject 10 
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3 i 


XC, = 8 XC, = 


2 


No 
M 
a 
NO 


Table 26.5 Summary of Data for Analysis of Example 26.2 with McNemar Test 


Condition 1 Row sums 
(Chenesco) 
No (0) Yes (1) 
Condition 2 No (0) a=1 b=8 9 
(Howasaki) Yes (1) cad d=0 1 
Column sums 2 8 10 
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Example 26.2 is evaluated below employing both the Cochran Q test and the McNemar 
test. Note the computed value Q = xy? = 5.44 is equivalent to X? = 5.44 computed for the 
McNemar test. 


Cochran Q test: 


EC sm; XO 51. ey M. QU 


2 
EEC? = 64+1=65 XR, -9 UR, -9 
Thus: C=65 T-9 R=9 


o - 2 = DIDE - 91 _ 5 44 
(2)(9) - 9 
McNemar test: 
p -6-9 8-1... 
(b + c) (8 + 1) 


In the case of both tests, df = 1 (since in the case of the Cochran Q test df = k - 1 =2 - 
1=1 = 2 - 1 = 1, and in the case of the McNemar test the number of degrees of freedom is 
always df= 1). The tabled critical .05 and .01 chi-square values for df= 1 are Xos = 3.84 and 
Xo = 6.63. Since the obtained value X? = 5.44 is greater than Xos = 3.84, the nondirectional 
alternative hypothesis is supported at the .05 level. Since X? = 5.44 is less than Xi - 6.63, 
it is not supported at the .01 level. 

In point of fact, the data in Table 26.4 are based on Example 19.1, which is employed to 
illustrate the binomial sign test for two dependent samples (Test 19). If we assume that in the 
case of Example 19.1 a subject is assigned a score of 1 in the condition in which she has a higher 
score and a score of 0 in the condition in which she has a lower score, plus the fact that a subject 
is assigned a score of 0 (or 1) in both conditions if she has the same score, the data in Table 19.1 
will be identical to that presented in Table 26.4.'° When Equation 19.3 (the uncorrected (for 
continuity) normal approximation for the binomial sign test for two dependent samples) is 
employed to evaluate Example 19.1, it yields the value z = 2.33 which, if squared, equals the 
obtained chi-square value for Example 26.2 — i.e., (z = 2.33)? = (y? = 5.44)." Thus, when 
k = 2 the McNemar test/Cochran Q test are equivalent to the binomial sign test for two 
dependent samples. It should also be noted that the exact binomial probability computed for 
the binomial sign test for two dependent samples will be equivalent to the exact binomial 
probability computed when the McNemar test/Cochran Q test is employed to evaluate the same 
data. For Examples 19.1/26.2, the two-tailed binomial probability is P(x > 8) = (2)(.0196) 
= .0392 (for n = 9). 


3. Alternative nonparametric procedures for categorical data for evaluating a design 
involving k dependent samples Daniel (1990) and/or Fliess (1981) note that alternative pro- 
cedures for comparing k or more matched samples have been developed by Bennett (1967, 1968) 
and Shah and Claypool (1985). Chou (1989) describes a median test that can be employed to 
evaluate more than two dependent samples. The latter test, which employs the chi-square dis- 
tribution to approximate the exact sampling distribution, employs subject/block medians as 
reference points in determining whether two or more of the treatment conditions represent 
different populations. The test described by Chou (1989) assumes that subjects’ original scores 
are in an interval/ratio format and are converted into categorical data. 
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VIII. Additional Examples Illustrating the Use of the Cochran Q Test 


Since the Cochran Q test can be employed to evaluate any dependent samples design involving 
two or more experimental conditions, it can also be used to evaluate any of the examples 
discussed under the McNemar test. Examples 26.3—26.7 are additional examples that can be 
evaluated with the Cochran Q test. Examples 26.6 and 26.7 represent extensions of a before- 
after design to a design involving k = 3 experimental conditions. Since the data for all of the 
examples in this section (with the exception of Example 26.7) are identical to the data employed 
in Example 26.1, they yield the same result. 


Example 26.3 A researcher wants to assess the relative likelihood of three brands of house 
paint fading within two years of application. In order to make this assessment he applies the 
following three brands of house paint that are identical in hue to a sample of houses that 
have cedar shingles: Brightglow, Colorfast, and Prismalong. In selecting the houses the 
researcher identifies 12 neighborhoods which vary with respect to geographical conditions, 
and within each neighborhood he randomly selects 3 houses. Within each block of three houses, 
one of the houses is painted with Brightglow, a second house with Colorfast, and a third house 
with Prismalong. Thus, a total of 36 houses are painted in the study. Two years after the houses 
are painted, an independent judge categorizes each house with respect to whether or not the 
paint on its shingles has faded. A house is assigned the number 1 if there is evidence of fading 
and the number 0 if there is no evidence of fading. Table 26.6 summarizes the results of the 
study. Do the data indicate differences between the three brands of house paint with respect to 
fading? 


Table 26.6 Data for Example 26.3 


Brand of paint 
Brightglow Colorfast Prismalong 
Block 1 1 1 0 
Block 2 0 1 0 
Block 3 1 1 1 
Block 4 0 1 0 
Block 5 0 1 0 
Block 6 0 1 1 
Block 7 0 0 0 
Block 8 0 1 0 
Block 9 1 1 0 
Block 10 0 1 0 
Block 11 0 0 0 
Block 12 0 0 1 


Note that in Example 26.3 the 12 blocks, comprised of 3 houses per block, are analogous 
to the use of 12 sets of matched subjects with 3 subjects per set/block. The brands of house paint 
represent the three levels of the independent variable, and the judge's categorization for each 
house with respect to fading (i.e., 1 versus 0) represents the dependent variable. Based on the 
analysis conducted for Example 26.1, there is a strong suggestion that Colorfast paint is 
perceived as more likely to fade than the other two brands. 


Example 26.4 Twelve male marines are administered a test of physical fitness which requires 


that an individual achieve the minimum criterion noted for the following three tasks: a) Climb 
a 100 ft. rope; b) Do 25 chin-ups; and c) Run a mile in under six minutes. Within the sample of 
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12 subjects, the order of presentation of the three tasks is completely counterbalanced (i.e., each 
of the six possible presentation orders for the tasks is presented to two subjects). For each of 
the tasks a subject is assigned a score of 1 if he achieves the minimum criterion and a score of 
0 if he does not. Table 26.7 summarizes the results of the testing. Do the data indicate there is 
a difference between the three tasks with respect to subjects achieving the criterion? 


Table 26.7 Data for Example 26.4 


Task 
Rope climb Chin-ups Mile run 
Subject 1 1 1 0 
Subject 2 0 1 0 
Subject 3 1 1 1 
Subject 4 0 1 0 
Subject 5 0 1 0 
Subject 6 0 1 1 
Subject 7 0 0 0 
Subject 8 0 1 0 
Subject 9 1 1 0 
Subject 10 0 1 0 
Subject 11 0 0 0 
Subject 12 0 0 1 


Based on the analysis conducted for Example 26.1, the data suggest that subjects are more 
likely to achieve the criterion for chin-ups than they are for the other two tasks. 


Example 26.5 A horticulturist working at a university is hired to evaluate the effectiveness of 
three different kinds of weed killer (Zapon, Snuffout, and Shalom). Twelve athletic fields of equal 
size are selected as test sites. The researcher divides each athletic field into three equally sized 
areas, and within each field (based on random determination) he applies one kind of weed killer 
to one third of the field, a second kind of weed killer to another third of the field, and the third 
kind of weed killer to the remaining third of the field. This procedure is employed for all 12 
athletic fields, resulting in 36 separate areas to which weed killer is applied. Six months after 
application of the weed killer, an independent judge evaluates the 36 areas with respect to weed 
growth. The judge employs the number 1 to indicate that an area has evidence of weed growth 
and the number 0 to indicate that an area does not have evidence of weed growth. Table 26.8 
summarizes the judge’s categorizations. Do the data indicate there is a difference in the effec- 
tiveness between the three kinds of week killer? 


Note that in Example 26.5 the 12 athletic fields are analogous to 12 subjects who are 
evaluated under three experimental conditions. The three brands of weed killer represent the 
levels of the independent variable, and the judge’s categorization of each area with respect to 
weed growth (i.e., 1 versus 0) represents the dependent variable. Based on the analysis 
conducted for Example 26.1, the data suggest that Snuffout is less effective than the other two 
brands of weed killer. 


Example 26.6 A social scientist conducts a study assessing the impact of a federal gun control 
law on rioting in large cities. Assume that as a result of legislative changes the law in question, 
which severely limits the publics' access to firearms, was not in effect between the years 
1985-1989, but was in effect during the five years directly preceding and following that time 
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period (i.e., the gun control law was in effect during the periods 1980—1984 and 1990-1994). 
In conducting the study, the social scientist categorizes 12 large cities with respect to whether 
or not there was a major riot within each of the three designated time periods. Thus, each city 
is categorized with respect to whether or not a riot occurred during: a) 1980—1984, during 
which time the gun control law was in effect (Time 1); b) 1985-1989, during which time the gun 
control law was not in effect (Time 2); and c) 1990—1994, during which time the gun control law 
was in effect (Time 3). A code of 1 is employed to indicate the occurrence of at least one major 
riot during a specified five-year time period, and a code of 0 is employed to indicate the absence 
of a major riot during a specified time period. Table 26.9 summarizes the results of the study. 
Do the data indicate the gun control law had an effect on rioting? 


Table 26.8 Data for Example 26.5 


Weed killer 
Zapon Snuffout Shalom 
Field 1 1 1 0 
Field 2 0 1 0 
Field 3 1 1 1 
Field 4 0 1 0 
Field 5 0 1 0 
Field 6 0 1 1 
Field 7 0 0 0 
Field 8 0 1 0 
Field 9 1 1 0 
Field 10 0 1 0 
Field 11 0 0 0 
Field 12 0 0 1 
Table 26.9 Data for Example 26.6 
Time period 
Time 1 Time 2 Time 3 
(1980-1984) (1985-1989) (1990-1994) 
New York 1 1 0 
Chicago 0 1 0 
Detroit 1 1 1 
Philadelphia 0 1 0 
Los Angeles 0 1 0 
Dallas 0 1 1 
Houston 0 0 0 
Miami 0 1 0 
Washington 1 1 0 
Boston 0 1 0 
Baltimore 0 0 0 
Atlanta 0 0 1 


Note that in Example 26.6 the 12 cities are analogous to 12 subjects who are evaluated 
during three time periods. Example 26.6 can be conceptualized as representing what is referred 
to as a time series design. A time series design is essentially a before-after design in which 
one or more blocks are evaluated one or more times both prior to and following an experimental 
treatment. In Example 26.6 each of the cities represents a block. Time series designs are most 
commonly employed in the social sciences when a researcher wants to evaluate social change 
through analysis of archival data (i.e., public records). The internal validity of a time series 
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design is limited insofar as the treatment is not manipulated by the researcher, and thus any 
Observed differences across time periods with respect to the dependent variable may be due to 
extraneous variables over which the researcher has no control. Thus, although when Example 
26.6 is evaluated with the Cochran Q test the obtained Q value is significant (suggesting that 
more riots occurred when the gun control law was not in effect), other circumstances (such as 
the economy, race relations, etc.) may have varied across time periods, and such factors (as 
opposed to the gun control law) may have been responsible for the observed effect." Another 
limitation of the time series design described by Example 26.6 is that since the membership of 
the blocks is not the result of random assignment, the various blocks will not be directly 
comparable to one another. 

In closing the discussion of Example 26.6, it should be noted that in practice the Cochran 
Q test is not commonly employed to evaluate a time series design. Additionally, it is worth 
noting that, in the final analysis, it is probably more prudent to conceptualize Example 26.6 as 
a mixed factorial design, viewing each of the cities as a separate level of a second independent 
variable. In a mixed factorial design involving two independent variables, one independent var- 
iable is a between-subjects variable (i.e., each subject/block is evaluated under only one level of 
that independent variable), while the other independent variable is a within-subjects variable (1.e., 
each subject/block is evaluated under all levels of that independent variable). If Example 26.6 
is conceptualized as a mixed factorial design, the different cities represent the between-subjects 
variable and the three time periods represent the within-subjects variable. Such a design is 
typically evaluated with the factorial analysis of variance for a mixed design (Test 27i) which 
is discussed in Section IX (the Addendum) of the between-subjects factorial analysis of vari- 
ance (Test 27). It should be noted, however, that in employing the latter analysis of variance, 
the dependent variable is generally represented by interval/ratio level data." 


Example 26.7 In order to assess the efficacy of a drug which a pharmaceutical company claims 
is effective in treating hyperactivity, 12 hyperactive children are evaluated during the following 
three time periods: a) One week prior to taking the drug; b) After a child has taken the drug for 
six consecutive months; and c) Six months after the drug is discontinued. The children are 
observed by judges who employ a standardized procedure for evaluating hyperactivity. The 
procedure requires that during each time period a child be assigned a score of 1 if he is 
hyperactive and a score of 0 if he is not hyperactive. During the evaluation process, the judges 
are blind with respect to whether a child is taking medication at the time he or she is evaluated. 
Table 26.10 summarizes the results of the study. Do the data indicate the drug is effective? 


Example 26.7 employs the same experimental design to evaluate the hypothesis that is 
evaluated in Example 24.7 (in Section VIII of the single-factor within-subjects analysis of 
variance (Test 24)). In Example 26.7, however, categorical data are employed to represent the 
dependent variable. Evaluation of the data with the Cochran Q test yields the value Q = 14.89. 


EC 13 X0,28 ECG 10 (QU)s144 ŒC} =o 4XGy = 100 
X(ECy = 144 + 9 «100-253 XR,-25 XR» = 57 


Thus: C = 253 T-25 R=57 


_ G - DIOQC53 - G5] 
(3)(25) - 57 


Q - 14.89 
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Table 26.10 Data for Example 26.7 


Time Period 


Time 1 Time 2 Time 3 
Child 1 1 0 1 
Child 2 1 0 1 
Child 3 1 0 1 
Child 4 1 0 0 
Child 5 1 0 0 
Child 6 1 0 1 
Child 7 1 1 1 
Child 8 1 0 1 
Child 9 1 0 1 
Child 10 1 0 1 
Child 11 1 1 1 
Child 12 1 1 1 


Since k = 3, df=2. The tabled critical .05 and .01 values for df = 2 are Xs = 5.99 and 
Xi = 9.21. Since the obtained value Q = xy? = 14.89 is greater than both of the aforemen- 
tioned critical values, the alternative hypothesis is supported at both the .05 and .01 levels. 
Inspection of Table 26.10 strongly suggests that the significant effect is due to the lower 
frequency of hyperactivity during the time subjects are taking the drug (Time period 2). The 
latter, of course, can be confirmed by conducting comparisons between pairs of time periods. 

As noted in the discussion of Example 24.7, the design of the latter study does not 
adequately control for the effects of extraneous/confounding variables. The same comments 
noted in the aforementioned discussion also apply to Example 26.7. Example 26.7 can also be 
conceptualized within the context of a time-series design. 
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Endnotes 


1. A more detailed discussion of the guidelines noted below can be found in Sections I and 
VII of the ¢ test for two dependent samples. 


2. Although it is possible to conduct a directional analysis, such an analysis will not be 
described with respect to the Cochran Q test. A discussion of a directional analysis when 
k = 2 can be found under the McNemar test. A discussion of the evaluation of a 
directional alternative hypothesis when k > 3 can be found in Section VII of the chi-square 
goodness-of-fit test (Test 8). Although the latter discussion is in reference to analysis of 
akindependent samples design involving categorical data, the general principles regarding 
analysis of a directional alternative hypothesis when k = 3 are applicable to the Cochran 
Q test. 


3. The use of Equation 26.1 to compute the Cochran Q test statistic assumes that the columns 
in the summary table (i.e., Table 26.1) are employed to represent the k levels of the inde- 
pendent variable, and that the rows are employed to represent the n subjects/matched sets 
of subjects. If the columns and rows are reversed (i.e., the columns are employed to 
represent the subjects/matched sets of subjects, and the rows the levels of the independent 
variable), Equation 26.1 cannot be employed to compute the value of Q. 


4. The same Q value is obtained if the frequencies of No responses (0) are employed in 
computing the summary values used in Equation 26.1 instead of the frequencies of Yes (1) 
responses. To illustrate this, the data for Example 26.1 are evaluated employing the fre- 
quencies of No (0) responses. 


EC, -9 EC,=3 Ec -9 Gcy-s5 Qocy-o Wc) -81 


2 
X(ECy = 81 +9 + 81 = 171 XR = 21 YR) = 45 
Thu: C-17] T-21 R -45 


p. C- VIOU s D'I _ 
(3)(21) - 45 


5.  Inthe discussion of comparisons in reference to the analysis of variance, it is noted that a 
simple (also known as a pairwise) comparison is a comparison between any two groups/ 
conditions in a set of k groups/conditions. 


6. The method for deriving the value of z,,, for the Cochran Q test is based on the same logic 


that is employed in Equation 22.5 (which is used for conducting comparisons for the 
Kruskal-Wallis one-way analysis of variance by ranks (Test 22)). A rationale for the 
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use of the proportions .0167 and .0083 in determining the appropriate value for z, dj in 
Example 26.1 can be found in Endnote 5 of the Kruskal-Wallis one-way analysis of 


variance by ranks. 


7. It should be noted that when a directional alternative hypothesis is employed, the sign of 
the difference between the two proportions must be consistent with the prediction stated in 
the directional alternative hypothesis. When a nondirectional alternative hypothesis is em- 
ployed, the direction of the difference between the two proportions is irrelevant. 


8. In Equation 26.3 the value z ,- 1.96 is employed for z, di^ and the latter value is multi- 
plied by .204, which is the value computed for the term in the radical of the equation for 


Example 26.1. 


9.  Inpointoffact, Equation 26.3 employs more information than the McNemar test, and thus 
provides a more powerful test of an alternative hypothesis than the latter test (assuming 
both tests employ the same value for «œ ç). The lower power of the McNemar test is 
directly attributed to the fact that for a given comparison, it only employs the scores of 
those subjects who obtain different scores under the two experimental conditions. 


10. In conducting the binomial sign test for two dependent samples, what is relevant is in 
which of the two conditions a subject has a higher score, which is commensurate with 
assigning a subject to one of two response categories. As is the case with the McNemar 
test and the Cochran Q test, the analysis for the binomial sign test for two dependent 
samples does not include subjects who obtain the same score in both conditions. 


11. Thevalue X? = 5.44 is also obtained for Example 19.1 through use of Equation 8.2, which 
is the equation for the chi-square goodness-of-fit test. In the case of Example 19.1 the 
latter equation produces an equivalent result to that obtained with Equation 19.3 (the 
normal approximation). The result of the binomial analysis of Example 19.1 with the chi- 
square goodness-of-fit test is summarized in Table 19.2. 


12. Within the framework of a time series design, one or more blocks can be included which 
can serve as controls. Specifically, in Example 26.6 additional cities might have been 
selected in which the gun control law was always in effect (i.e., in effect during Time 2 as 
well as during Times 1 and 3). Differences on the dependent variable during Time 2 
between the control cities and the cites in which the law was nullified between 1985-1989 
could be contrasted to further evaluate the impact of the gun control law. Unfortunately, 
if the law in question is national, such control cities would not be available in the nation in 
which the study is conducted. The reader should note, however, that even if such control 
cities were available, the internal validity of such a study would still be subject to challenge, 
since it would still not ensure adequate control over extraneous variables. 


13. Related to the issue of employing an analysis of variance with a design such as that 
described by Example 26.6, Cochran (1950) and Winer et al. (1991) note that if a single- 
factor within-subjects analysis of variance is employed to evaluate the data in the 
Cochran Q test summary table (i.e., Table 26.1), it generally leads to similar conclusions 
as those reached when the data are evaluated with Equation 26.1. The question of whether 
it is appropriate to employ an analysis of variance to evaluate the categorical data in the 
Cochran Q test summary table is an issue on which researchers do not agree. 
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(and Related Measures of 
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Test 27: The Between-Subjects Factorial Analysis of Variance 
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Test 27 


The Between-Subjects Factorial Analysis of Variance 
(Parametric Test Employed with Interval/Ratio Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


The between-subjects factorial analysis of variance is one of a number of analysis of variance 
procedures that are employed to evaluate a factorial design. A factorial design is employed to 
simultaneously evaluate the effect of two or more independent variables on a dependent variable. 
Each of the independent variables is referred to as a factor. Each of the factors has two or more 
levels, which refer to the number of groups/experimental conditions that comprise that indepen- 
dent variable. If a factorial design is not employed to assess the effect of multiple independent 
variables on a dependent variable, separate experiments must be conducted to evaluate the effect 
of each of the independent variables. One major advantage of a factorial design is that it allows 
the same set of hypotheses to be evaluated at a comparable level of power by using only a fraction 
of the subjects that would be required if separate experiments were conducted to evaluate the 
relevant hypotheses for each of the independent variables. Another advantage of a factorial design 
is that it permits a researcher to evaluate whether or not there is an interaction between two or 
more independent variables — the latter being something that cannot be determined if only one 
independent variable is employed in a study. An interaction is present in a set of data when the 
performance of subjects on one independent variable is not consistent across all the levels of 
another independent variable. The concept of interaction is discussed in detail in Section V. 

The between-subjects factorial analysis of variance (also known as a completely ran- 
domized factorial analysis of variance) is an extension of the single-factor between-subjects 
analysis of variance (Test 21) to experiments involving two or more independent variables. 
Although the between-subjects factorial analysis of variance can be used for more than two 
factors, the computational procedures described in this book will be limited to designs involving 
two factors. One of the factors will be designated by the letter A, and will have p levels, and the 
second factor will be designated by the letter B, and will have q levels. As a result of this, there 
will be a total of p x q groups. A p x q between-subjects/completely randomized factorial 
design requires that each of the p x q groups is comprised of different subjects who have been 
randomly assigned to that group. Each group serves under one of the p levels of Factor A and 
one of the q levels of Factor B, with no two groups serving under the same combination of levels 
of the two factors. All possible combinations of the levels of Factor A and Factor B are 
represented by the total p x q groups. 

The between-subjects factorial analysis of variance evaluates the following hypotheses: 

a) With respect to Factor A: In the set of p independent samples (where p » 2), do at least 
two of the samples represent populations with different mean values? The latter hypothesis can 
also be stated as follows: Do at least two of the levels of Factor A represent populations with 
different mean values? 

b) With respect to Factor B: In the set of q independent samples (where q > 2), do at least 
two of the samples represent populations with different mean values? The latter hypothesis can 
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also be stated as follows: Do at least two of the levels of Factor B represent populations with 
different mean values? 

c) In addition to evaluating the above hypotheses (which assess the presence or absence of 
what are referred to as main effects),' the between-subjects factorial analysis of variance 
evaluates the hypothesis of whether there is a significant interaction between the two factors/ 
independent variables. 

A discussion of the theoretical rationale underlying the evaluation of the three sets of 
hypotheses for the between-subjects factorial analysis of variance can be found in Section VII. 

The between-subjects factorial analysis of variance is employed with interval/ratio data 
and is based on the following assumptions: a) Each sample has been randomly selected from the 
population it represents; b) The distribution of data in the underlying population from which each 
of the samples is derived is normal; and c) The third assumption, which is referred to as the 
homogeneity of variance assumption, states that the variances of the p x q underlying popu- 
lations represented by the p x q groups are equal to one another. The homogeneity of variance 
assumption (which is discussed earlier in the book in reference to the f test for two independent 
samples (Test 11), the ¢ test for two dependent samples (Test 17), the single-factor between- 
subjects analysis of variance and the single-factor within-subjects analysis of variance (Test 
24)) is discussed in greater detail in Section VI. If any of the aforementioned assumptions of 
the between-subjects factorial analysis of variance are saliently violated, the reliability of the 
computed test statistic may be compromised. 


II. Example 


Example 27.1 A study is conducted to evaluate the effect of humidity (to be designated as 
Factor A) and temperature (to be designated as Factor B) on mechanical problem-solving 
ability. The experimenter employs a 2 x 3 between-subjects factorial design. The two levels that 
comprise Factor A are A,: Low humidity; A,: High humidity. The three levels that comprise 
Factor B are B,: Low temperature; B,: Moderate temperature; B4: High temperature. The 
study employs 18 subjects, each of whom is randomly assigned to one of the six experimental 
groups (i.e, p X q = 2 x 3 = 6) resulting in three subjects per group. Each of the six 
experimental groups represents a different combination of the levels that comprise the two 
factors. The number of mechanical problems solved by the three subjects in each of the six 
experimental conditions/groups follow. (The notation Group AB,, indicates the group that 
served under Level j of Factor A and Level k of Factor B.) Group AB, Low humidity/Low 
temperature (11, 9, 10); Group AB,,: Low humidity/Moderate temperature (7, 8, 6); Group 
AB: Low humidity/High temperature (5, 4, 3); Group AB,,: High humidity/Low temperature 
(2, 4, 3); Group AB,,: High humidity/Moderate temperature (4, 5, 3); Group AB,,: High 
humidity/ High temperature (0, 1, 2). Do the data indicate that either humidity or temperature 
influences mechanical problem-solving ability? 


III. Null versus Alternative Hypotheses 
A between-subjects factorial analysis of variance involving two factors evaluates three sets 
of hypotheses. The first set of hypotheses evaluates the effect of Factor A on the dependent 


variable, the second set evaluates the effect of Factor B on the dependent variable, and the third 
set evaluates whether or not there is an interaction between the two factors. 
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Set 1: Hypotheses for Factor A 


Null hypothesis Hy: p A, = Pa, 
(The mean of the population Level 1 of Factor A represents equals the mean of the population 
Level 2 of Factor A represents.) 


Alternative hypothesis H: p A, * Pa, 
(The mean of the population Level 1 of Factor A represents does not equal the mean of the 
population Level 2 of Factor A represents. This is a nondirectional alternative hypothesis. In 
the discussion of the between-subjects factorial analysis of variance it will be assumed (unless 
stated otherwise) that an alternative hypothesis is stated nondirectionally? In order for the 
alternative hypothesis for Factor A to be supported, the obtained F value for Factor A (designated 
by the notation F’, ) must be equal to or greater than the tabled critical F value at the prespecified 
level of significance.) 


Set 2: Hypotheses for Factor B 


Null hypothesis Hy Hg, = Hp, = Be, 


(The mean of the population Level 1 of Factor B represents equals the mean of the population 
Level 2 of Factor B represents equals the mean of the population Level 3 of Factor B represents.) 


Alternative hypothesis H,: Not H) 


(This indicates that there is a difference between at least two of the g = 3 population means. 
It is important to note that the alternative hypothesis should not be written as follows: 
H: Hg, * Hp * Hg, The reason why the latter notation for the alternative hypothesis is 
incorrect is because it implies that all three population means must differ from one another in 
order to reject the null hypothesis. In order for the alternative hypothesis for Factor B to be 
supported, the obtained F value for Factor B (designated by the notation F,) must be equal to 
or greater than the tabled critical F value at the prespecified level of significance.) 


Set 3: Hypotheses for interaction 


H . 


: There is no interaction between Factor A and Factor B. 


H,: There is an interaction between Factor A and Factor B. 


Although it is possible to state the null and alternative hypotheses for the interaction 
symbolically, such a format will not be employed since it requires a considerable amount of 
notation. It should be noted that in predicting an interaction, a researcher may be very specific 
with respect to the pattern of the interaction that is predicted. As a general rule, however, such 
predictions are not reflected in the statement of the null and alternative hypotheses. In order for 
the alternative hypothesis for the interaction to be supported, the obtained F value for the 
interaction (designated by the notation F’,,) must be equal to or greater than the tabled critical 
F value at the prespecified level of significance. 
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IV. Test Computations 


The test statistics for the between-subjects factorial analysis of variance can be computed with 
either computational or definitional equations. Although definitional equations reveal the 
underlying logic behind the analysis of variance, they involve considerably more calculations 
than do the computational equations. Because of the latter, computational equations will be 
employed in this section to demonstrate the computation of the test statistic. The definitional 
equations for the between-subjects factorial analysis of variance are described in Section VII. 

The data for Example 27.1 are summarized in Table 27.1. In the latter table the following 
notation is employed. 


N represents the total number of subjects who serve in the experiment. In Example 27.1, 


N=18. 
YX, represents the total sum of the scores of the N = 18 subjects who serve in the 
_ experiment. 
X, represents the mean of the scores of the N = 18 subjects who serve in the experiment. 


X, will be referred to as the grand mean. 
XX represents the total sum of the squared scores of the N = 18 subjects who serve in the 
experiment. 
Xx represents the square of the total sum of scores of the N = 18 subjects who serve in the 
experiment. 
represents the number of subjects who serve in Group AB; , In Example 27.1 


ies nay = 3. In some of the equations that follow, the notation n is employed to rep- 
resent the value n AB, 
XX AB; represents the sum of the scores of the 71, B, 7 3 subjects who serve in Group AB, E 
X AB; represents the mean of the scores of the n,, = 3 subjects who serve in Group AB, p 
XX represents the sum of the squared scores an the lg, = 3 subjects who serve in Group 
AB, 
(EX s y ‘represents the square of the sum of scores of the n,, = 3 subjects who serve in 
" Group AB, 
n, represents (m number of subjects who serve in level j of Factor A. In Example 27.1, 
On, = s JQ) = 08) =9 
XX 4, represents the sum of the scores of the n, — 9 subjects who serve in Level j of 
H ' Factor A. ] 
X 4, represents the mean of the scores of the n, = 9 subjects who serve in level j of 
Factor A. 
Ex? represents the sum of the squared scores of the Ny _= 9 subjects who serve in Level 
] j of Factor A. 
(LX, y represents the square of the sum of scores of the n, = 9 subjects who serve in Level 
' jof Factor A. 
ng, apis the number of es who serve in level k of Factor B. In Example 27.1, 
ng, = (yp MP) = 0 = 
UX, ON the sum of the scores of the ng = = 6 subjects who serve in Level k of 
_ Factor B. 
Xp represents the mean of the scores of the ng = 6 subjects who serve in Level k of 
Factor B. 
XX; represents the sum of the squared scores of the n B, = 6 subjects who serve in Level 
k of Factor B. 
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(LX, Y represents the square of the sum of scores of the n p, = © subjects who serve in Level 
k k 





k of Factor B. 
Table 27.1 Data for Example 27.1 
Factor B (Temperature) 
Factor A 
(Humidity) Row sums 
Group AB,, Group AB,, Group AB; Level A, 
2 2 2 
Xp. Xin, Xi, Xin, Xp, Xin. 
11 121 7 49 5 25 
9 81 16 
10 100 36 
= n, = 9 
A, E 
(do) XX a 63 
F XX - s d 
n, 
XX; -501 
QX s Y - (30)? =900 ÈX) =(21}? =441 i» =(12)? -144 (x, y= (63 = 3969 
Group AB Group AB,, Group AB; Level A; 
2 2 2 
X aB, Xin, X, Xin, Xp, Xin, 
4 16 
16 25 
n, = 9 
A, 
(High) XX aus 24 
YX, a 
EnG] 
n, 9 
2 
XX A= 84 
(XX ms = (24)? - 576 
Grand Total 
N=18 
X= 87 
Column Xr 97 
sums Dar a 
XX; =585 


XX y = (87) = 7569 
(EX, Y = (39) = 1521 | XX, Y = 83 = 1089 | (XX, Y = (15)? -225 (2X7) =(87) 
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As is the case for the single-factor between-subjects analysis of variance, the total 
variability for the between-subjects factorial analysis of variance can be divided into between- 
groups variability and within-groups variability. The between-groups variability can be 
divided into the following: a) Variability attributable to Factor A; b) Variability attributable to 
Factor B; and c) Variability attributable to any interaction that is present between Factors A and 
B (which will be designated as AB variability). For each of the variability components involved 
in the between-subjects factorial analysis of variance, a sum of squares is computed. Thus, 
the following sum of squares values are computed: a) SS.., the total sum of squares; b) SS,., 
the between-groups sum of squares; c) SS ,, the sum of squares for Factor A (which can also 
be referred to as the row sum of squares, since in Table 27.1 Factor A is the row variable); 
d) SS,, the sum of squares for Factor B (which can also be referred to as the column sum of 
squares, since in Table 27.1 Factor B is the column variable); e) SS,,, the interaction sum of 
squares; and f) SS yg. the within-groups sum of squares, which is also referred to as the error 
sum of squares or residual sum of squares, since it represents variability that is due to chance 
factors which are beyond the control of the researcher. Each of the aforementioned sum of 
squares values represents the numerator in the equation that is employed to compute the variance 
for that variability component (which is referred to as the mean square for that component). 

Equations 27.1—27.3 summarize the relationship between the sum of squares components 
for the between-subjects factorial analysis of variance. Equation 27.1 summarizes the rela- 
tionship between the between-groups, the within-groups, and the total sums of squares. 


SS. = SS,G + SSwe (Equation 27.1) 


Because of the relationship noted in Equation 27.2, Equation 27.1 can also be written in the 
form of Equation 27.3. 


SSpq = SS, + $$, + SS, (Equation 27.2) 
SS, = SS, + SS, + Sag + Syg (Equation 27.3) 


In order to compute the sums of squares for the between-subjects factorial analysis of 
variance, the following summary values are computed with Equations 27.4—27.8 which will be 
employed as elements in the computational equations: [XS], [T ], [A], [B], [AB]. The reader 
should take note of the fact that in the equations that follow, the following is true: a) n AB, 
=n = 3; b) N = npg, and thus, N = (3)(2)(3) = 18; c) n, - nq, and thus, n, = (3)(3) = 9; 
d) ng = Mp, and thus, ng = (3)(2) = 6. 

The summary value [XS] = 585 is computed with Equation 27.4. 

(Equation 27.4) 
[XS] = xx = (11) + (9 + (10)? + + + (0 + (1)? + (2 = 585 


The summary value [T ] = 420.5 is computed with Equation 27.5. 


2 2 
T] = —— = —— = 420.5 Equation 27.5 
[7] R i (Eq ) 


The summary value [A] = 505 is computed with Equation 27.6. 





p (XX,y 2 2 
[A] = Y Ar |. (63) + E 505 (Equation 27.6) 
Pip 9 9 


© 2000 by Chapman & Hall/CRC 


The notation XIEX ml n,] in Equation 27.6 indicates that for each level of Factor A, 
J J 
the scores of the n, - - 9 subjects who serve under that level of the factor are summed, the 


resulting value is squared, and the obtained value is divided by Ny, = 9. The values obtained 
for each of the p = 2 levels of Factor A are then summed. 
The summary value [B] = 472.5 is computed with Equation 27.7. 


XX y 
[B] - Y. VM | 2169? GE ey = 472.5 (Equation 27.7) 


k1| Np 6 6 


The notation Xt IX A Yin p ] in Equation 27.7 indicates that for each level of Factor B, 
k k 
the scores of the ng = 6 subjects who serve under that level of the factor are summed, the 
k 


resulting value is squared, and the obtained value is divided by n, = 6. The values obtained 
k 
for each of the g = 3 levels of Factor B are then summed. 
The summary value [AB] = 573 is computed with Equation 27.8. 


a p |\(2X,, 2 2 
[4B] - Y - SY, OT 


k=l j=l] n 
à 4Bj (Equation 27.8) 
2 2 2 2 
, 2 , a2% ,025),G' . 573 
3 2 3 3 

The notation Ea Dj [EX aB) n AB, J in Equation 27.8 indicates that for each of the 
pq = © groups, the scores of the n AB, 7 3 subjects who serve in that group are summed, the 
resulting value is squared, and the obtained value is divided by n AB, 7 = 3. The values obtained 


for each of the pg = 6 groups are then summed. 
Employing the summary values computed with Equations 27.4—27.8, Equations 27.9— 27.14 


can be employed to compute the values SS,, SS,,, SS,, SS,, SS,,, and SS... 
Equation 27.9 is employed to compute the value SS, = - 164. 5. 
= [XS] - [T] = 585 - 420.5 = 164.5 (Equation 27.9) 


Equation 27.10 is employed to compute the value SS, = 152.5. 
SS5c = [AB] - [T] = 573 - 420.5 = 152.5 (Equation 27.10) 
Equation 27.11 is employed to compute the value SS, - 84.5. 
= [A] - [T] = 505 - 420.5 = 84.5 (Equation 27.11) 
Equation 27.12 is employed to compute the value SS, - 
= [B] - [T] = 472.5 - 420.5 = 52 (Equation 27.12) 


Equation 27.13 is employed to compute the value SS,, = 16. 
(Equation 27.13) 


SS, = [AB] - [A] - [B] + [T] = 573 - 505 - 472.5 + 420.5 = 16 
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Equation 27.14 is employed to compute the value SS, = 12 E 
SS; = [XS] - [AB] = 585 - 573 = 12 (Equation 27.14) 


Note that SS, = SS, + SS, + SS,, = 84.5 + 52 + 16 = 152.5 and SS, = SS, 
+ SS = 152.5 + 12 = 164.5. 

The reader should take note of the fact that the values SS,., SS,., SS,, SS,, SS,,, and SS, 
must always be positive numbers. If a negative value is obtained for any of the aforementioned 
values, it indicates a computational error has been made. 

At this point the mean square values (which as previously noted represent variances) for 
the above components can be computed. In order to compute the test statistics for the between- 
subjects factorial analysis of variance, it is only required that the following mean square values 
be computed: MS,,MS,, MS,, MS,,, and MS,.. 

MS, is computed with Equation 27.15. 





SS, ; 
MS, = — (Equation 27.15) 
df, 
MS, is computed with Equation 27.16. 
SS, A 
MS, = — (Equation 27.16) 
df, 
MSp is computed with Equation 27.17. 
SS p i 
MS, = —— (Equation 27.17) 
fap 
MS yg is computed with Equation 27.18. 
SS wg ; 
MS, = (Equation 27.18) 
dfwg 
In order to compute MS}, MS,, MS,,, and MS... it is required that the values df}, 


df,, df, 5, and dfyg (the denominators of Equations 27.15-27.18) be computed. 
df, are computed with Equation 27.19. 


d,-p-1 (Equation 27.19) 
df, are computed with Equation 27.20. 

d,-q-1 (Equation 27.20) 
df, are computed with Equation 27.21. 


dir= p- Y@ = 1) (Equation 27.21) 
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df yg are computed with Equation 27.22. As noted earlier, the value n is equivalent to the 
value n,, . The use of n in any of the equations for the between-subjects factorial analysis of 
x ; 
variance assumes that there are an equal number of subjects in each of the pg groups. 


dfyg = pan - 1) (Equation 27.22) 


Although they are not required in order to compute the F ratios for the between-subjects 
factorial analysis of variance, the between-groups degrees of freedom ( df,..), and the total 
degrees of freedom (df) are generally computed, since they can be used to confirm the df 
values computed with Equations 27.19-27.22, as well as the fact that they are employed in the 
analysis of variance summary table. 

df, c; are computed with Equation 27.23. 


dfg = pq - 1 (Equation 27.23) 
df, are computed with Equation 27.24. 
df, =N-1 (Equation 27.24) 
The relationships between the various degrees of freedom values are described below. 
dfsg = df, + df + Uap dfr = Bee + dwc 


Employing Equations 27.19-27.24, the values df, = 1, df, = 2, df,, = 2, dfyg = 12, 
dfzg = 5, and df, = 17 are computed. 


d,-2-1-21 d=3-1=2 df ,=2-)B-1)=2 


dj. = DGB -1)=12 dho = [DC] -1=5 dfp,=18-1-=17 


Note that df,, = df, + dfg + df, =1+2+2=5 and df, = df, + dyg =5 +12 =17. 
Employing Equations 27.15-27.18, the following values are computed: MS, = 84.5, 
MS, = 26, MS,, = 8, MS,, = 1. 


MS, PET -84.5 MS,- = -26 MS, = 


= 
Si 


-8 MS,,=—<=1 


The F ratio is the test statistic for the between-subjects factorial analysis of variance. 
Since, however, there are three sets of hypotheses to be evaluated, it is required that three F ratios 
be computed — one for each of the components that comprise the between-groups variability. 
Specifically, an F ratio is computed for Factor A, for Factor B, and for the AB interaction. 
Equations 27.25-27.27 are, respectively, employed to compute the three F ratios. 








MS 

F, = á (Equation 27.25) 
MS wo 
MS 

| È (Equation 27.26) 
MS wo 
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MS 
F. =- AB 
4B MS 





(Equation 27.27) 
WG 


Employing Equations 27.25-27.27, the values F, = 84.5, Fẹ = 26, and F,, = 8 are 
computed. 
84.5 26 
vele e di =A Fup = 


= | 00 


The reader should take note of the fact that any value computed for a mean square or an 
F ratio must always be a positive number. If a negative value is obtained for any mean square 
or F ratio, it indicates a computational error has been made. If MSyg = 0, Equations 27.25- 
27.27 will be insoluble. The only time MS, = 0 is when, within each of the pq groups, all 
subjects obtain the same score (i.e., there is no within-groups variability). If the mean values for 
all of the levels of any factor are identical, the mean square value for that factor will equal zero, 
and, if the latter is true, the F value for that factor will also equal zero. 


V. Interpretation of the Test Results 


It is common practice to summarize the results of a between-subjects factorial analysis of 
variance with the summary table represented by Table 27.2. 


Table 27.2. Summary Table of Analysis of Variance for Example 27.1 


Source of variation SS df MS F 
Between-groups 152.5 5 
A 84.5 1 84.5 84.5 
B 52 2 26 26 
AB 16 2 8 8 
Within-groups 12 12 1 
Total 164.5 17 


The obtained F values are evaluated with Table A10 (Table of the F Distribution) in the 
Appendix. In Table A10 critical values are listed in reference to the number of degrees of 
freedom associated with the numerator and the denominator of an F ratio. Thus, in the case of 
Example 27.1 the values for df,, df,, and df,, are employed for the numerator degrees of 
freedom for each of the three F ratios, while df, is employed as the denominator degrees of 
freedom for all three F ratios. As is the case in the discussion of other analysis of variance 
procedures discussed in the book, the notation F',. is employed to represent the tabled critical 
F value at the .05 level. The latter value corresponds to the tabled F ,; value in Table A10. In 
the same respect, the notation F y, will be employed to represent the tabled critical F value at 
the .01 level, and the latter value will correspond to the relevant tabled F',, value in Table A10. 

The following tabled critical values are employed in evaluating the three F ratios computed 
for Example 27.1: a) Factor A: For df am = df, = land dfin = dfy; = 12, Fos = 4.75 and 
F ıı = 9.33; b) Factor B: For df... = df, = 2 and dfin = Gyo = 12, Fo; = 3.89 and 
F ,, = 6.93; c) AB interaction: For df, = Yag = 2 and dfin = Uwe = 12, Fos = 3.89 
and F = 6.93. 

In order to reject the null hypothesis in reference to a computed F ratio, the obtained F value 
must be equal to or greater than the tabled critical value at the prespecified level of significance. 
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Since the computed value F, = 84.5 is greater than F; = 4.75 and F) = 9.33, the alterna- 
tive hypothesis for Factor A is supported at both the .05 and .01 levels. Since the computed 
value Fẹ = 26isgreaterthan F,, = 3.89 and F, = 6.93,the alternative hypothesis for Factor 
B is supported at both the .05 and .01 levels. Since the computed value F,, - 8 is greater 
than Fg; = 3.89 and F,, = 6.93, the alternative hypothesis for an interaction between Factors 
A and B is supported at both the .05 and .01 levels. The aforementioned results can be 
summarized as follows: F(1,12) = 84.5, p < .01; F,(2,12) = 26, p < .01; 
F,4,Q.12) = 8, p «.01. 

The analysis of the data for Example 27.1 allows the researcher to conclude that both 
humidity (Factor A) and temperature (Factor B) have a significant impact on problem-solving 
scores. Thus, both main effects are significant. As previously noted, a main effect describes the 
effect of one factor/independent variable on the dependent variable, ignoring any effect any of 
the other factors/independent variables might have on the dependent variable. There is also, 
however, a significant interaction present in the data. As noted in Section I, the latter indicates 
that the effect of one factor is not consistent across all the levels of the other factor. It is 
important to note that the presence of an interaction renders any significant main effects 
meaningless, since it requires that the relationship described by a main effect be qualified. This 
is the case, since when an interaction is present, the nature of the relationship between the levels 
of a factor on which a significant main effect is detected will depend upon which level of the 
second factor is considered. Table 27.3, which summarizes the data for Example 27.1, will be 
used to illustrate this point. The six cells in the Table 27.3 contain the means of the pq = 6 
groups. The values in the margins of the rows and columns of the table, respectively, represent 
the means of the levels of Factor A and Factor B. In Table 27.3 the average of any row or 
column can be obtained by adding all of the values in that row or column, and dividing the sum 
by the number of cells in that row or column.” 


Table 27.3 Group and Marginal Means for Example 27.1 


Factor B (Temperature) 


B, B, B, Row 
(Low) (Moderate) (High) averages 
Factor A A, (Low) 10 7 4 7 
(Humidity) A, (High) 3 4 1 2.67 
Column averages 6.5 5.5 2.5 Grand mean = 4.83 


In Table 27.3 the main effect for Factor A (Humidity) indicates that as humidity increases 
the number of problems solved decreases (since (X a5 T) > (X "a 2.67)). Similarly, the 
main effect for Factor B (Temperature) indicates that as temperature increases, the number of 
problems solved decreases (since (X, = 6.5) > (Xp = 5.5) > (X, = 2.5)) However, 
closer inspection of the data reveals that the effects of the factors on the dependent variable are 
not as straightforward as the main effects suggest. Specifically, the ordinal relationship depicted 
for the main effect on Factor B is only applicable to Level 1 of Factor A. Although under the low 
humidity condition (A, ) the number of problems solved decreases as temperature increases, the 
latter is not true for the high humidity condition (A, ). Under the latter condition the number of 
problems solved increases from 3 to 4 as temperature increases from low to moderate but then 
decreases to 1 under the high temperature condition. Thus, the main effect for Factor B is mis- 
leading, since it is based on the result of averaging the data from two rows which do not contain 
consistent patterns of information. In the same respect, if one examines the main effect on Factor 
A, it suggests that as humidity increases, performance decreases. Table 27.3, however, reveals 
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that although this ordinal relationship is observed for all three levels of Factor B, the effect is 
much more pronounced for Level 1 (low temperature) than it is for either Level 2 (moderate 
temperature) or Level 3 (high temperature). Thus, even though the ordinal relationship described 
by the main effect is consistent across the three levels of Factor B, the magnitude of the 
relationship varies depending upon which level of Factor B is considered. 

Figure 27.1 summarizes the information presented in Table 27.3 in a graphical format. 
Each of the points depicted in the graphs described by Figures 27.1a and 27.1b represents the 
average score of the group that corresponds to the level of the factor represented by the line on 
which that point falls and the level of the factor on the abscissa (X-axis) above which the point 
falls. An interaction is revealed on either graph when two or more of the lines are not equidistant 
from one another throughout the full length of the graph, as one moves from left to right. When 
two or more lines on a graph intersect with one another, as is the case in Figure 27.1a, or two or 
more lines diverge from one another, as is the case in Figure 27.1b, it more than likely indicates 
the presence of an interaction. The ultimate determination, however, with respect to whether or 
not a significant interaction is present should always be based on the computed value of the F,, 
ratio. 
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Figure 27.1 Graphical Summary of Results of Example 27.1 


In an experiment in which there are two factors, either of two graphs can be employed 
to summarize the results of the study. In Figure 27.1a the levels of Factor A are represented 
on the abscissa, and three lines are employed to represent subjects’ performance on each of 
the levels of Factor B (with reference to the specific levels of Factor A). In Figure 27.1b the 
levels of Factor B are represented on the abscissa, and two lines are employed to represent 
subjects’ performance on each of the levels of Factor A (with reference to the specific levels of 
Factor B). As noted earlier, the fact that an interaction is present is reflected in Figures 27.1a 
and 27.1b, since the lines are not equidistant from one another throughout the length of both 
graphs.° 

Table 27.4 and Figure 27.2 summarize a hypothetical set of data (for the same experiment 
described by Example 27.1) in which no interaction is present. For purposes of illustration it will 
be assumed that in this example the computed values F} and F, are significant, while Fẹ is not. 
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Table 27.4 Hypothetical Values for Group and Marginal Means 
When There Is No Interaction 


Factor B (Temperature) 


B, B, B, Row 

(Low) (Moderate) (High) averages 
Factor A A, (Low) 10 8 6 8 
(Humidity) A, (High) 6 4 2 4 
Column averages 8 6 4 Grand mean = 6 
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Figure 27.2 Graphical Summary of Results Described in Table 27.4 


Inspection of Table 27.4 and Figure 27.2 indicates the presence of a main effect on both 
Factors A and B and the absence of an interaction. The presence of a main effect on Factor A 
is reflected in the fact that there is a reasonably large difference between X, = 8 and X, = 4 
In the same respect, the significant main effect on Factor B is reflected in the diiceebanby 
between the mean values X, - 8 X, = 6, and x. - 4. The conclusion that there is no 
interaction is based on the fact that the relationship described by each of the main effects is 
consistent across all of the levels of the second factor. To illustrate this, consider the main effect 
described for Factor A. In Table 27.4, the main effect for Factor A indicates that subjects solve 
4 more problems under the low humidity condition than under the high humidity condition, and 
since this is the case regardless of which level of Factor B one considers, it indicates that there 
is no interaction between the two factors. The absence of an interaction is reflected in Figure 
27.2a, since the three lines are equidistant from one another. In Table 27.4 the main effect for 
Factor B indicates that the number of problems solved decreases in steps of 2 as one progresses 
from low to moderate to high temperature. This pattern is consistent across both of the levels of 
Factor A. The absence of an interaction is also reflected in Figure 27.2b, since the two lines 
representing each of the levels of Factor A are equidistant from one another (as well as being 


parallel) throughout the length of the graph. 
The term additive model is commonly employed to describe an analysis of variance in 
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which there is no interaction (whereas the term nonadditive is employed when there is an 
interaction). The use of the term additive within this context reflects the fact that the mean of any 
of the p x q cells can be obtained by adding the row and column effects for that cell to the grand 
mean (Myers and Well, 1995). 


VI. Additional Analytical Procedures for the Between-Subjects 
Factorial Analysis of Variance and/or Related Tests 


1. Comparisons following the computation of the F values for the between-subjects 
factorial analysis of variance Upon computing the omnibus F values, further analysis of the 
data comparing one or more groups and/or factor levels with one another can provide a 
researcher with more detailed information regarding the relationship between the independent 
variables and the dependent variable. Since the procedures to be described in this section are 
essentially extensions of those employed for the single-factor between-subjects analysis of 
variance, the reader should review the discussion of comparison procedures in Section VI of the 
latter test before proceeding. The discussion in this section will examine additional analytical 
procedures that can be conducted following the computation of the F values under the following 
three conditions: a) No significant main effects or interaction are present; b) One or both 
main effects are significant, but the interaction is not significant; and c) A significant 
interaction is present, with or without one or more of the main effects being significant. 
Table 27.5 is a summary table for Example 27.1 depicting all of the group means for which 
comparison procedures will be described in this section. 


Table 27.5 Summary Table of Means for Example 27.1 


Factor B (Temperature) 





Row 
averages 
X 
Factora 4! ^i 
Humidit X 
( Y A, X, 
Column averages Xp Xy, Xy 


a) No significant main effects or interaction are present If in a between-subjects 
factorial analysis of variance neither of the main effects or interaction is significant, in most 
instances it will not be productive for a researcher to conduct additional analysis of the data. If, 
however, prior to the data collection phase of a study a researcher happens to have planned any 
of the specific types of analyses to be discussed later in this section, he can still conduct them 
regardless of whether or not any of the F values are significant (and not be obliged to control the 
value of &,,,). Although one can also justify conducting additional analytical procedures that 
are unplanned, in such a case most statisticians believe that a researcher should control the 
familywise Type I error rate ( &,,,), in order that it not exceed what would be considered to be 
a reasonable level. 

b) One or both main effects are significant, but the interaction is not significant When 
at least one of the F values is significant, the first question the researcher must ask prior to 
conducting any additional analytical procedures is whether or not the interaction is significant. 
When the interaction is not significant, a factorial design can essentially be conceptualized as 
being comprised of two separate single factor experiments. As such, both simple and complex 
comparisons can be conducted contrasting different means or sets of means that represent the 
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levels of each of the factors. Such comparisons involve contrasting within a specific factor the 
marginal means (i.e., the means of the p rows and the means of the q columns). In the case of 
Example 27.1, a simple comparison can be conducted in which two of the three levels of Factor 
B are compared with one another (i.e., X p, Versus X5). or a complex comparison in which a 
composite mean involving two levels of Factor B is compared with the mean of the third level 
of Factor B (i.e., (X, * X5)2 versus Xy. ). If Factor B has four levels, a complex comparison 
contrasting two sets of composite means (each set representing a composite mean of two of the 
four levels) can be conducted (i.e., (X, * X52 versus (Xy * X5 0/2). Since there are only 
two levels of Factor A, no additional comparisons are possible involving the means of the levels 
of that factor (1.e., the omnibus F value for Factor A represents the comparison X A, Versus X a ). 
As is the case for the single-factor between-subjects analysis of variance, in designs in which 
one or both of the factors are comprised of more than three levels, it is possible to conduct an 
omnibus F test comparing the means of three or more of the levels of a specific factor. In 
addition to all of the aforementioned comparisons, within a given level of a specific factor, 
simple and complex comparisons can be conducted that contrast the means of specific groups that 
are a combination of both factors (i.e., a simple comparison such as X AB,, versus X Ap,» OF a 
complex comparison such as X Ap, Versus (X dot 4p, )/2)^ It is worth reiterating that, 
whenever possible, comparisons should be planned prior to the data collection phase of a study, 
and that any comparisons which are conducted should address important theoretical and/or 
practical questions that underlie the hypotheses under study. In addition, the total number of 
comparisons that are conducted should be limited in number, and should not be redundant with 
respect to the information they provide. 

C) A significant interaction is present with or without one or more of the main effects 
being significant As noted previously, when the interaction is significant the main effects are 
essentially rendered meaningless, since any main effects will have to be qualified in reference 
to the levels of a second factor. Thus, any comparison that involves the levels of a specific factor 
(e.g., Xp, versus X B,) will reflect both the contribution of that factor, as well as the interaction 
between that factor and the second factor. For this reason, the most logical strategy to employ 
if a significant interaction is obtained is to test for what are referred to as simple effects. A test 
of a simple effect is essentially an analysis of variance evaluating all of the levels of one factor 
across only one level of the other factor. In the case of Example 27.1, two simple effects can be 
evaluated for Factor B. Specifically, an F test can be conducted which evaluates the scores of 
subjects on Factor B, but only for those subjects who serve under Level 1 of Factor A (i.e., an 
F ratio is computed for Groups AB,,, AB,,, and AB,,). A second simple effect for Factor B 
can be evaluated by contrasting the scores of subjects on Factor B, but only for those subjects 
who serve under Level 2 of Factor A (i.e., Groups AB,;, AB, and AB,,). In the case of 
Factor A, there are three possible simple effects that can be evaluated. Specifically, separate F 
tests can be conducted which evaluate the scores of subjects on Factor A for only those subjects 
who serve under: a) Level 1 of Factor B (i.e., Groups AB,, and AB,, ); b) Level 2 of Factor B 
(i.e., Groups AB,, and AB,,); and c) Level 3 of Factor B (i.e., Groups AB,, and AB,,). In 
the event that one or more of the simple effects are significant, additional simple and complex 
comparisons contrasting specific groups within a given level of a factor can be conducted (e.g., 
a simple comparison such as X Ap, Versus X Ap, OF à complex comparison such as X AB,, 
versus Xs * X 9/2). 

Description of analytical procedures (Including the following comparison procedures that are 
described for the single-factor between-subjects analysis of variance, which in this section, 
are described in reference to the between-subjects factorial analysis of variance: Test 27a: 
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Multiple ¢ tests/Fisher’s LSD test (which is equivalent to linear contrasts); Test 27b: The 
Bonferroni-Dunn test; Test 27c: Tukey's HSD test; Test 27d: The Newman-Keuls test; 
Test 27e: The Scheffé test; Test 27f: The Dunnett test) 


Comparisons between the marginal means The equations that are employed in conducting 
simple and complex comparisons involving the marginal means are basically the same equations 
that are employed for conducting comparisons for the single-factor between-subjects analysis 
of variance. Thus, in comparing two marginal means or two sets of marginal means (in the case 
of complex comparisons), linear contrasts can be conducted when no attempt is made to control 
the value of &,,,, (which will generally be the case for planned comparisons). In the case of 
either planned or unplanned comparisons where the value of c, is controlled, any of the 
multiple comparison procedures discussed under the single-factor between-subjects analysis 
of variance can be employed (i.e., The Bonferroni-Dunn test, Tukey's HSD test, The 
Newman-Keuls test, The Scheffé test, and The Dunnett test). The only difference in 
employing any of the latter comparison procedures with a factorial design is that the sample size 
employed in a comparison equation will reflect the number of subjects in each of the levels of 
the relevant factor. Thus, any comparison involving the marginal means of Factor A will involve 
nq subjects per group (in Example 27.1, nq = (3)(3) = 9), and any comparison involving the 
marginal means of Factor B will involve np subjects per group (in Example 27.1, np = (3)(2) = 
6). 

As an example, assume we want to compare the scores of subjects on two of the levels of 
Factor B — specifically Level 1 versus Level 3 (i.e., X, versus X, ). If no attempt is made to 
control the value of «,,, Equations 27.28-27.30 (which are the analogs of Equations 
21.17-21.19 employed for conducting linear contrasts for the single-factor between-subjects 
analysis of variance) are employed to conduct a linear contrast comparing the two levels of 
Factor B (which within the framework of the comparison are conceptualized as two groups). 
Note that in Equation 27.28 the value np represents the number of subjects who served under 
each level of Factor B, and [cg XX, p) will equal the squared difference between the means 
of the two levels of Factor B that are being compared (i.e., in the case of the comparison under 
discussion, [3c Xp = X, - X, Y). 


np cy X) 
xar 


SS = 


B comp 


(Equation 27.28) 


SS, comj 
SB comp = P (Equation 27.29) 
: df, B comp 





MS 
d (Equation 27.30) 
"^ MSc 


The data from Example 27.1 are now employed in Equations 27.28-27.30 to conduct the 
comparison X, , Versus Xy, . Note that since Levels 1 and 3 of Factor B constitute the groups that 
are involved i in the comparison, the coefficients for the comparison are Cp = +1, Cp = Ae 


Cy, = -l. Thus, Ec, = 2 and [XXc, 20, M = Xp zR = 6.5 - 2'5)? = Tos 


Substituting the appropriate values in Équation 27.28, the value SS 


Bong - 48 is us 
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B comp 2 


Since all linear contrasts represent a single degree of freedom comparison, df, i^ 
1. Employing Equations 27.29 and 27.30, the values MS, T" = 48 and F, = 48 are com- 


puted. Note that the value MS, = 1 computed for the omnibus F test is “employed i in the 
denominator of Equation 27.30. 


-48 i 
MS, comp x 48 
. 48 . 
Fs comp — 1 - 48 
The value F, coing = 48 is evaluated with Table A10. Employing the latter table, the 


appropriate degrees of freedom for the numerator and denominator are df am = df, Sins = 1 and 
HN dfyg = 12. For df, = 1 and dfin = 12, the tabled critical .05 and .01 values are 
= 4.75 and F = 9.33. Since the obtained value I bm 48 is greater than the afore- 
ee critical values, the nondirectional alternative hypothesis H,: uş # Mp. is supported 
at both the .05 and .01 levels. 
Equations 27.31-27.33 are employed to evaluate comparisons involving the levels of 


Factor A. 


ng{X(c4 (XP 
SS, = RN NN (Equation 27.31) 
comp Xie 2 
4j 
SS 
MS = PUR (Equation 27.32) 





A 
mis df A comp 


MS, com A 
Fy comp = —s (Equation 27.33) 
p MS yg 


Note that in Equation 27.31 nq represents the sample size, which in this case is the num- 
ber of subjects who serve in each level of Factor A. The value [}(c 4AXX, DE is equal to 
(X, - X, J^ which is the squared difference between the two means involved inthe comparison 
(where x ad y represent the levels of Factor A that are employed in the comparison). 

As is the case with comparisons conducted for a single-factor between-subjects analysis 
of variance, a CD value can be computed for any comparison. Recollect that a CD value rep- 
resents the minimum required difference in order for two means to differ significantly from one 
another. To demonstrate this, two CD values will be computed for the comparison Xp. versus 
Xp. Specifically, CD, ., and CD, will be computed. CD, gsp (which is the CD value associated 
with the linear contrast that is conducted with Equations 27.28-27.30) is the lowest possible 
difference that can be computed with any of the available comparison procedures. CD, (the 
value for the Scheffé test), on the other hand, computes the largest CD value from the methods 
that are available. If the obtained difference for a comparison is less than CD, , the null 
hypothesis will be retained, whereas if it is larger than CD, it will be rejected. For the purpose 
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of this discussion, it will be assumed that an obtained difference that is larger than CD, „p but 
less than CD, will be relegated to the suspend judgement category." 
Equation 27.34 is employed to compute the value CD,,,, = 1.25 for the simple com- 


parison X, versus X, , for a —.05. In point of fact, the CD, ;, value computed with Equation 


LSD 


i LSD 
27.34 applies to all three simple comparisons that can be conducted with respect to Factor B (i.e., 





X, - X, =65-55=1;X, -X, 265-2524 X, - X, 255-2523). 
2MS 
CD,sp = JF wo, FS = JETS IET - 1.25 (Equation 27.34) 
m 


In order to differ significantly at the .05 level, the means of any two levels of Factor B must 
differ from one another by at least 1.25 units. Thus, the differences X B, T X B, = 4 and 
Xy, P Xy, - 3 are significant, while the difference Xp. - Xy - ] is not. 

Note that Equation 27.34 is identical to Equation 21.24 employed for computing the CD, sp 
value for the single-factor between-subjects analysis of variance, except for the fact that in 


Equation 27.34, np subjects are employed per group/level of Factor B. In Equation 27.34, the 


value Fa we) = 4.75 is the tabled critical .05 F value for df... and df... , which represent the 
degrees of freedom associated with the F} comp value computed with Equation 27.30. 


Equations 27.35 and 27.36, which are analogous to Equation 21.25 (which is the generic 
equation for both simple and complex comparisons for CD, ,, for the single-factor between- 
subjects analysis of variance), are, respectively, the generic equations for Factors A and B for 


computing CD, s. 





Eci MS, 

CDi sy = yF a. Wow = (Equation 27.35) 
(Xc2 MS) 

CDs = Fawon = (Equation 27.36) 


At this point the Scheffé test will be employed to conduct the simple comparison Xp. 
versus Xp Equation 27.37, which is analogous to Equation 21.32 (which is the equation for 
simple comparisons for CD s for the single-factor between-subjects analysis of variance), is 
employed to compute the value CD, = 1.61, with «py = .05. The value Fewo ^ 3.89 used 
in Equation 27.37 is the tabled critical .05 F value employed in evaluating the main effect for 
Factor B in the omnibus F test. 


(Equation 27.37) 


2MS 
CD, = SA- DF, ore = VE - 0689 I - 161 


Thus, in order to differ significantly at the .05 level, the means of any two levels of Factor 
B must differ from one another by at least 1.61 units. As is the case when CD,., = 1.25 is 
computed, the differences Xy, = Xp, = 4 and Xp, = Xy, = 3 are significant, while the dif- 





ference X, - X, = 1 is not. 
1 2 
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Equations 27.38 and 27.39, which are analogous to Equation 21.33 (which is the generic 
equation for both simple and complex comparisons for CD. for the single-factor between- 
subjects analysis of variance), are, respectively, the generic equations for Factors A and B for 
computing CD,. Note that in conducting comparisons involving the levels of Factor A, the value 
F employed in Equation 27.38 is the tabled critical F value at the prespecified level of 


(4, WG) 
significance used in evaluating the main effect for Factor A in the omnibus F test. 










Eci MS) 

CD, = lp - DE, we a (Equation 27.38) 
Ecg (MS yc) l 

CD, = Kq- DE, uc XE RS (Equation 27.39) 


In closing the discussion of the Scheffé test, it should be noted that since Equation 27.37 
only takes into account those comparisons that are possible involving the levels of Factor B, it 
may not be viewed as imposing adequate control over a, if one intends to conduct additional 
comparisons involving the levels of Factor A and/or specific groups that are a combination of 
both factors. Because of this, some sources make a distinction between the familywise error 
rate (c,,,) and the experimentwise error rate. Although in the case of a single factor 
experiment the two values will be identical, in a multifactor experiment, a familywise error rate 
can be computed for comparisons within each factor as well as for comparisons between groups 
that are based on combinations of the factors. The experimentwise error rate will be a 
composite error rate which will be the result of combining all of the familywise error rates. 
Thus, in the above example if one intends to conduct additional comparisons involving the levels 
of Factor A and/or groups that are combinations of both factors, one can argue that the Scheffé 
test as employed does not impose sufficient control over the value of the experimentwise error 
rate. Probably the simplest way to deal with such a situation is to conduct a more conservative 
test in evaluating any null hypotheses involving the levels of Factor A, Factor B, or groups that 
are combinations of both factors (i.e., evaluate a null hypothesis at the .01 level instead of at the 
.05 level). 


Evaluation of an omnibus hypothesis involving more than two marginal means If the 
interaction is not significant, it is conceivable that a researcher may wish to conduct an F test on 
three or more marginal means in a design where the factor involved has four or more levels. In 
other words, if in Example 27.1 there were four levels on Factor B instead of three, one might 
want to evaluate the null hypothesis Ho: Hg, = Mp, = Hg, The logic that is employed in con- 
ducting such an analysis for the single-factor between-subjects analysis of variance can be 
extended to a factorial design. Specifically, in the case of a 2 x 4 design, a between-subjects 
factorial analysis of variance employing all of the data is conducted initially. Upon determin- 
ing that the interaction is not significant, a single-factor between-subjects analysis of variance 
can then be conducted employing only the data for the three levels of Factor B in which 
the researcher is interested (i.e, B,, B,, and B,). The following F ratio is computed: 
F (B, /B,/B,) ^ MS EE d MS. Note that the mean square value in the numerator is based on 
the between-groups variability in the single-factor between-subjects analysis of variance 
that involves only the data for levels B,, B,, and B, of Factor B. The degrees of freedom 
associated with the numerator of the F ratio is 2, since it is based on the number of levels 
of Factor B evaluated with the single-factor between-subjects analysis of variance (i.e., 
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Te, JB,B,) = 3 - | =2). The mean square and degrees of freedom for the denominator of the 
F ratio are the within-groups mean square and degrees of freedom computed for the between- 
subjects factorial analysis of variance when the full set of data is employed (i.e., the data for 
all four levels of Factor B). For further clarification of the aforementioned procedure the reader 


should review Section VI of the single-factor between-subjects analysis of variance. 


Comparisons between specific groups that are a combination of both factors The pro- 
cedures employed for comparing the marginal means can also be employed to evaluate 
differences between specific groups that are a combination of both factors (e.g., a comparison 
such as X Ap, Versus X AB," Such differences are most likely to be of interest when an 


interaction is present. It should be noted that these are not the only types of comparisons that can 
provide more specific information regarding the nature of an interaction. A more comprehensive 
discussion of further analysis of an interaction can be found in books that specialize in the 
analysis of variance. Keppel (1991), among others, provides an excellent discussion of this 
general subject. 

In comparing specific groups with one another, the same equations are essentially employed 
that are used for the comparison of marginal means, except for the fact that the equations must 
be modified in order to accommodate the sample size of the groups. Both simple and complex 
comparisons can be conducted. As an example, let us assume we want to conduct a linear 
contrast for the simple comparison X Ap, Versus X AB," Equation 27.40 is employed for 


conducting such a comparison. Note that the latter equation has the same basic structure as 
Equations 27.28 and 27.31, but is based on the sample size of n, which is the sample size of each 
of the p x q groups. 


n[EXc s Xap, P 


2 
Eci B, 


SS - 


comp 


(Equation 27.40) 


In Equation 27.40 the value [X(c Ip Xx ap )P is equal to the squared difference between 
jk jk 


the means of the two groups that are being compared (i.e., for the comparison under discussion 
it yields the same value as (X; - X, Y). Note that since only two of the p x q = 6 groups 
11 12 


are involved in the comparison, the coefficients for the comparison are C4 =+1, Cy, —-1, 
m 11 — 12— 
and = es rr - 2 
C AB; : 0 for the remaining four groups. Thus, [X(c 45, € 45] (X AB,, X AB, ) 
- (10 - 7Y = (3? = 9 and o. - 2. Substituting the appropriate values in Equation 27.40, 





the value SS comp = 13.5 is computed. 
2 
SS. = OY = 13.5 
comp 2 
Employing Equations 21.18 and 21.19, the values MS comp = 13.5 and T oup - 13.5 are 
computed. 
SS 
MS com É comp 2 13.5 = 13 5 
" df op 1 
MS 
om ^ — = i. 13.5 
P? MS a 1 
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Employing Table A10, the appropriate degrees of freedom for the numerator and 
denominator are df... = df. ab 1 (since the comparison is a single degree of freedom 
comparison) and dfin = Hyg = 12. For df am = 1 and dfin = 12, the tabled critical .05 
and .01 values are F; = 4.75 and F,, = 9.33. Since the obtained value T ani = 13.5 is 
greater than the aforementioned critical values, the nondirectional alternative hypothesis 
Ai: Hap, * Map, is supported at both the .05 and .01 levels. 

CD57 and CD. values will now be computed for the above comparison, for a = .05. 
CDs) is computed with Equation 27.41 (which is identical to Equation 21.24 employed to 
compute CD, s, for the single-factor between-subjects analysis of variance). Note that the 
sample size employed in Equation 27.41 is n = n,, = 3. Substituting the appropriate values 


in Equation 27.41, the value CD, s, = 1.78 is computed. 





2MS 
Disp = Fus — e = AB 20 - 1.78 (Equation 27.41) 


Since in order to differ significantly at the .05 level the means of any two groups must differ 
from one another by at least 1.78 units, the difference X AB, T X AB, ^ 3 is significant. If we 


conduct comparisons for all 15 possible differences between pairs of groups (1.e., all simple 
comparisons), any difference that is equal to or greater than 1.78 units is significant at the .05 
level." Recollect, though, that since the computation of a CD; sp value does not control the 
value of &,,,, the per comparison Type I error rate will equal .05. 

Equation 27.42 (which is analogous to Equation 21.25 employed for the single-factor 
between-subjects analysis of variance) is the generic form of Equation 27.41 that can be 
employed for both simple and complex comparisons. 





Ecis, ) (MS) 





(Equation 27.42) 


CD,sp = Fa wo) 


CD, is computed with Equation 27.43 (which is analogous to Equation 21.32 employed 
to compute CD, for the single-factor between-subjects analysis of variance). Substituting the 


appropriate values in Equation 27.43, the value CD, = 3.22 is computed. The value F ewes 
BG, Wt 


= 3.11 used in Equation 27.43 is the tabled critical .05 F value for df... = Weg = pq - 1 
= (2)(3) - 1 =5 and df, = df, = 12 employed in the omnibus F test. 


n 


2MS 


n 


- = WG 
CD, = ,/(pq Fy we 


(Equation 27.43) 


= A230) - 1IG1D eu -322 


Thus, in order for any pair of means to differ significantly, the difference between the two 
means must be equal to or greater than 3.22 units. Since the difference X,, - X,, —3is 
11 12 


less than CD, = 3.22, the null hypothesis cannot be rejected if the Scheffé test is employed. 
Thus, the nondirectional alternative hypothesis H,: p AB, * Pap, is not supported. 


Equation 27.44 (which is analogous to Equation 21.33 employed for the single-factor 
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between-subjects analysis of variance) is the generic form of Equation 27.43 that can be 
employed for both simple and complex comparisons. 





Qc; (MS 


wo) 


CD, = pq - (Equation 27.44) 


1 Wargo, WG) n 


Since the linear contrast procedure yields a significant difference and the Scheffé test does 
not, one might want to suspend judgement with respect to the comparison X,, versus 
=, 11 


Xg until a replication study is conducted. However, it is certainly conceivable that many 
12 


researchers might consider the Scheffé test to be too conservative a procedure. Thus, one might 
elect to use a less conservative procedure such as Tukey's HSD test. Equation 27.45 (which is 
analogous to Equation 21.31 employed to compute CD, for the single-factor between- 
subjects analysis of variance) is employed to compute CD,,,,. The value q df = 4.75 
in Equation 27.45 is the value of the Studentized range statistic in Table A13 (Table of the 
Studentized Range Statistic) in the Appendix for k = pg = 6 and dfy, = 12. 


z MS wg H 1. : 
CDyusp = do, qr, ES 4.75 = 2.74 (Equation 27.45) 


Since CD,,, = 2.74 is less than X,, - X44, = 3, we can conclude that the difference 
11 12 
between the groups is significant." 


The computation of a confidence interval for a comparison The same procedure described 
for computing a confidence interval for a comparison for the single-factor between-subjects 
analysis of variance can also be employed for the between-subjects factorial analysis of 
variance. Specifically, the following procedure is employed for computing a confidence interval 
for any of the methods described in this section: The obtained CD value is added to and 
subtracted from the obtained difference between the two means (or sets of means in the case of 
a complex comparison). The resulting range of values defines the confidence interval. The 95% 
confidence interval will be associated with a computed CD ,; value, and the 99% confidence 
interval will be associated with a computed CD ,, value. To illustrate the computation of a con- 
fidence interval, the 95% confidence interval for the value CD,,,, = 2.74 computed for the 
comparison X Ap, Versus X AB, is demonstrated below. 


Clos = Kip. X, ) + CDys = 3 4 274 


Thus, the researcher can be 95% sure (or the probability is .95) that the mean of the 
population represented by Group AB,, is between .26 and 5.74 units larger than the mean of the 
population represented by Group AB,,. This result can be stated symbolically as follows: 


< (Haz, = Hap) < 5.74. 


Analysis of simple effects Earlier in this section it was noted that the most logical strategy to 
employ when a significant interaction is detected is to initially test for what is referred to as 
simple effects. A test of a simple effect is essentially an analysis of variance evaluating all of 
the levels of one factor across only one level of the other factor. The analysis of simple effects 
will be illustrated with Example 27.1 by evaluating the simple effects of Factor B. Specifically, an 
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F test will be conducted to evaluate the scores of subjects on the three levels of Factor B, but 
only the nq subjects who served under Level 1 of Factor A (i.e., an F ratio will be computed 
evaluating the GroupsAB,,, AB,,, and AB,,). This represents the analysis of the simple effect 
of Factor B at level A,. An analysis of a second simple effect (which represents the analysis 
of the simple effect of Factor B at level A,) will evaluate the scores of subjects on the three 
levels of Factor B, but only the nq subjects who served under Level 2 of Factor A (i.e., Groups 
AB,,, AB,,, and AB,,). 

Although it will not be done in reference to Example 27.1, since an interaction is present 
a comprehensive analysis of the data would also involve evaluating the simple effects of Factor 
A. There are three simple effects of Factor A that can be evaluated, each one involving 
comparing the scores of subjects on Factor A, but employing only the np subjects who served 
under one of the three levels of Factor B. The three simple effects of Factor A involve the 
following contrasts: 1) The simple effect of Factor A at Level B,: X Ap, Versus X,, ;2) The 


. — AB», á 
simple effect of Factor A at Level B,: X,,. Versus X,, ;and 3) The simple effect of Factor A 
e = 12 22 
at Level B}: X,, versus X 13 


13 ABy, ` mn 
In order to evaluate a simple effect, it is necessary to initially compute a sum of squares for 
the specific effect. Thus, in evaluating the simple effects of Factor B it is necessary to compute 
a sum of squares for Factor B at Level 1 of Factor A (SS, ,, , ) and a sum of squares for Factor 
1 


B at Level 2 of Factor A (SS, x 4 


effects for a specific factor, F ratios are computed for each of the simple effects by dividing the 
mean square for a simple effect (which is obtained by dividing the simple effect sum of squares 
by its degrees of freedom) by the within-groups mean square derived for the factorial analysis 
of variance. This procedure will now be demonstrated for the simple effects of Factor B. 
Equation 27.46 is employed to compute the sum of squares for each of the simple effects. 
If 3X,, represents the sum of the scores on Level j of Factor A of subjects who serve under 


a specific level of Factor B, the notation X[(CX E ?!/n] in Equation 27.46 indicates that the sum 


). Upon computing all of the sums of squares for the simple 


of the scores for each level of Factor B at a given level of Factor A is squared, divided by n, 
and the q squared sums are summed. The notation (XXX y Y. represents the square of the sum 
i 


of scores of the ng subjects who serve under the specified level of Factor A." 




















(o> ee (EXX Y 
SSp at 4, = x. —À— 5 mE (Equation 27.46) 
SS - y (EX, y T EEX y _ (30)? + (21)? + (12) (63)? 7 
bie n nq 3 G3)3) 
(EX, ) 





SS, at A, — 2- - 14 








OGNXXQY orca en ay 


n nq 3 (33) 


Table 27.6 summarizes the analysis of variance for the simple effects of Factor B. Note 
that for each of the simple effects, the degrees of freedom for the effect is dfg „4 -q- 1 
= 3 - 1 = 2 (which equals df, employed for the between-subjects factorial analysis of 
variance). The mean square for each simple effect is obtained by dividing the sum of squares 
for the simple effect by its degrees of freedom. The F value for each simple effect is obtained 
by dividing the mean square for the simple effect by MS, = 1 computed for the factorial 


analysis of variance. Thus, Fp „4 = 27/1 = 27 and F,,,, = 7/1 - 7. 
1 2 
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Table 27.6 Analysis of Simple Effects of Factor B 


Source of variation SS df MS F 
Bat A, 54 2 27 27 
Bat A, 14 2 7 7 
Within-groups 12 12 1 


Employing Table A 10, the degrees of freedom used in evaluating each of the simple effects 


are df. = Ue a a = 2 and dfi, = dfyg = 12. Since both of the obtained values Fp ,, , = 27 
j 1 
and Fp aa, = 7 are greater than F; = 3.89 and F, = 6.93 (which are the tabled critical 


values for r = 2 and df,,, = 12), each of the simple effects is significant at both the .05 
and .01 levels. On the basis of the result of the analysis of the simple effects of Factor B, we can 
conclude that within each level of Factor A there is at least one simple or complex comparison 
involving the levels of Factor B that is significant. 

As noted earlier, when one or more of the simple effects is significant, additional simple 
and complex comparisons contrasting specific groups can be conducted. Thus, for Level 1 of 
Factor A, simple comparisons between X, B, X AB," and X ‘4B, ,° 98 well as complex comparisons 
(such as X Ap, Versus (X AB, + X Ap, J/2) can clarify the locus of the significant simple effect. 

If the homogeneity of variance assumption of the between-subjects factorial analysis of 
variance (which is discussed in the next section) is violated, in computing the F ratios for the 
simple effects, a researcher can justify employing a MS, value that is just based on the groups 
involved in analyzing a specific simple effect, instead of the value of MS yg computed for the 
factorial analysis of variance. If the latter is done, the within-groups degrees of freedom em- 
ployed in the analysis of the simple effects of Factor B becomes df; = q(n - 1) instead of 
dfyg = pq(n - 1). Since the within-groups degrees of freedom is smaller if dfyg = q(n - 1) 
is employed, the test will be less powerful than a test employing dfyg = pq(n - 1). The loss 
of power can be offset, however, if the new value for MS,,,, is lower than the value derived for 
the omnibus F test." 

The reader should take note of the fact that the variability within each of the simple effects 
is the result of contributions from both the main effect on the factor for which the simple effect 
is being evaluated (Factor B in our example), as well as any interaction between the two factors. 
For this reason, the total of the sum of squares for each of the simple effects for a given factor 
will be equal to the interaction sum of squares (SS,,,) plus the sum of squares for that factor 
(SS,). This can be confirmed by the fact that in our example the following is true: 


(SS aa, = 54) + (SS, = 14)] = KSS4, = 16) + (SS, = 52)] = 68 


It should be noted that analysis of simple effects in and of itself cannot provide definitive 
evidence with regard to the presence or absence of an interaction. In point of fact, it is possible 
for only one of two simple effects to be significant, and yet the value of F,, computed for the 
factorial analysis of variance may not be significant. For a full clarification of this issue the 
reader should consult Keppel (1991). 


2. Evaluation of the homogeneity of variance assumption of the between-subjects factorial 
analysis of variance The homogeneity of variance assumption discussed in reference to the 
single-factor between-subjects analysis of variance is also an assumption of the between- 
subjects factorial analysis of variance. Since both tests employ the same protocol in evaluating 
this assumption, prior to reading this section the reader should review the relevant material for 
evaluating the homogeneity of variance assumption (through use of Hartley's F nax test (Test 


max 
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11a)) in Section VI of the single-factor between-subjects analysis of variance (as well as the 
material on Hartley's F nax test in Section VI of the ¢ test for two independent samples). 

In the case of the between-subjects factorial analysis of variance, evaluation of the 
homogeneity of variance assumption requires the researcher to compute the estimated population 
variances for each of the pg groups The latter values are computed with Equation I.5. As it turns 
out, the value of the estimated population variance for all six groups equals $ f B, = 1. This can 
be demonstrated below for Group AB,,. i 





EX y 
2 AB; 2 
EXua. —.. -  300- SOY 
= MAB 3 11 
Bu Hg, 1 3-1 


Upon determining that the value of both the largest and smallest of the estimated population 
variances equals 1, Equation 21.37 is employed to compute the value of the F pax statistic. 
Employing Equation 21.37, the value F_ = 1 is computed. 


max 


S 
Fas = === 
E. 4 


In order to reject the null hypothesis ( H,: 0? = 0;) and thus conclude that the homogene- 
ity of variance assumption is violated, the obtained F ax value must be equal to or greater than 
the tabled critical value at the prespecified level of significance. Employing Table A9 (Table 
of the F nax Distribution) in the Appendix, we determine that the tabled critical F ay values 
forn =n a 3 and k = pq = 6 groups are Fax = = 266 and P razn = 1362. Since the 


= 266, the null hypothesis cannot be rejected. In 


AB, 


j X 9s 
obtained value F = 1 is less than F 
max ma 


X 
other words, the alternative hypothesis indicalis the presence of heterogeneity of variance is not 
supported. The latter should be obvious without the use of the F.... test, since the same value 
is computed for the variance of each of the groups. 

In instances where the homogeneity of variance assumption is violated, the researcher 
should employ one of the strategies recommended for heterogeneity of variance that are 
discussed in Section VI of the single-factor between-subjects analysis of variance. The 
simplest strategy is to use a more conservative test (i.e., employ a lower œ level) in evaluating 
the three sets of hypotheses for the factorial analysis of variance.'® 


3. Computation of the power of the between-subjects factorial analysis of variance Prior 
to reading this section the reader should review the procedure described for computing the power 
of the single-factor between-subjects analysis of variance, since the latter procedure can 
be generalized to the between-subjects factorial analysis of variance. In determining the 
appropriate sample size for a factorial design, a researcher must consider the predicted effect size 
for each of the factors, as well as the magnitude of any predicted interactions. Thus, in the case 
of Example 27.1, prior to the experiment, a separate power analysis can be conducted with 
respect to the main effect for Factor A, the main effect for Factor B, and the interaction between 
the two factors. The sample size the researcher should employ will be the largest of the sample 
sizes derived from analyzing the predicted effects associated with the two factors and the 
interaction. As is the case for the single-factor between-subjects analysis of variance, such 
an analysis will require the researcher to estimate the means of all of the experimental groups, 
as well as the value of error/within-groups variability (i.e., 6,6): 
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Equation 27.47, which contains the same basic elements that comprise Equation 21.38, is 
the general equation that is employed for determining the minimum sample size necessary in 
order to achieve a specified power with regard to either of the main effects or the interaction. 





Xd’ 


$ = |(number of observations) ———— € 
(Gf.eiect + Doy) 


(Equation 27.47) 








The following should be noted with respect to Equation 27.47: 

a) The value employed for the number of observations will equal nq for Factor A, np for 
Factor B, and n for the interaction. 

b) Xd? represents the sum of the squared deviation scores. This value is obtained as 
follows: 1) For Factor A, p deviation scores are computed by subtracting the estimated 
grand mean (uç) from each of the estimated means of the levels of Factor A (i.e., 
d 4 T Ba, T Ug). Xd ?. the sum of the squared deviation scores, is obtained by squaring the 


p deviation scores and summing the resulting values; 2) For Factor B, q deviation scores are 
computed by subtracting the estimated grand mean from each of the estimated means of the 
levels of Factor B (i.e., d, = Mg, 7 Hg). The sum of the squared deviation scores is obtained 
by squaring the q deviation scores and summing the resulting values; and 3) For the interaction, 
pq deviation scores are computed — one for each of the groups. A deviation score is computed 
for each group by employing the following equation: d,, = My, - By T Hp + Hg. The 
latter equation indicates the following: The mean of the ae is estimated (Hy By ), after which 
both the estimated mean of the level of Factor A the group serves under ( 4, ) and the estimated 
mean of the level of Factor B the group serves under ( Hp) are subtracted “from the estimated 
mean of the group. The estimated grand mean (ug) is then added to this result. The resulting 
value represents the deviation score for that group. Upon computing a deviation score for each 
of the pq groups, the pq deviation scores are squared, after which the resulting squared deviation 
scores are summed. The resulting value equals Xd?. 

C) (dfe * 1) for Factor A equals df, + 1 = p. (df, + 1) for Factor B equals 
df, + 1 =q. (df + 1) for the interaction equals df,, + 1 = (p - Dq - 1) + 1. 

d) dwg is the estimate of the population variance for any one of the pq groups (which are 
assumed to have equal variances if the homogeneity of variance assumption is true). If a pe 
analysis is conducted after the data collection phase of a study, it is logical to employ MS yg 
the estimate of 07,;." 

To illustrate the use of Equation 27.47, the power of detecting the main effect on Factor 
B will be computed. Let us assume that based on previous research, prior to evaluating the 
data we estimate the following values: Hp = d. Hp = 5, Hp = 3. Since we know that 

= (Hg, + Mp, * Bp Gs bg = (7 + 5 + 3)/3 = 5.5 Tt will also be assumed that the 
VN value for error variability is Boos - 1.5. The relevant values are now substituted in 
Equation 27.47. 








Y E 2 
p ae E n(2) 
(qo wo) 


If we employ n = 3 subjects per groups (as is the case in Example 27.1), the value 
@ = 3.27 is computed: @ = 1.89/3 = 3.27. Employing Table A15 (Graphs of the Power 
Function for the Analysis of Variance) in the Appendix, we use the set of power curves for 


(P2SP Oy es 4 
(3)(1.5) aa 
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Boum = Vetter = Yg = 2, and within that set employ the curve for dfin = dwg = 12, for 
a= .05. Since a perpendicular line erected from the value @ = 3.27 on the abscissa to the curve 
for dfyg = 12 is beyond the highest point on the curve, the power of the test for the estimated 
effect on Factor B will be 1 if n = 3 subjects are employed per group. Thus, there is a 100% 
likelihood that an effect equal to or larger than the one stipulated by the values employed in 
Equation 27.47 will be detected. 

Although it will not be demonstrated here, to conduct a thorough power analysis it is 
necessary to also determine the minimum required sample sizes required to achieve what a 
researcher would consider to be the minimum acceptable power for identifying the estimated 
effects for Factor A and the interaction. The largest of the values computed for n for each of the 
three power analyses is the sample size that should be employed for each of the pq groups in the 
study. For a more comprehensive discussion on computing the power of the between-subjects 
factorial analysis of variance the reader should consult Cohen (1977, 1988). 


4. Measures of magnitude of treatment effect for the between-subjects factorial analysis 
of variance: Omega squared (Test 27g) and Cohen'sf index (Test 27h) Prior to reading this 
section the reader should review the discussion of magnitude of treatment effect in Section VI 
of both thet test for two independent samples and the single-factor between-subjects analysis 
of variance. The discussion for the latter test notes that the computation of an omnibus F 
value only provides a researcher with information regarding whether the null hypothesis can be 
rejected — i.e., whether a significant difference exists between at least two of the experimental 
treatments within a given factor. An F value (as well as the level of significance with which it 
is associated), however, does not provide the researcher with any information regarding the size 
of any treatment effect that is present. As is noted in earlier discussions of treatment effect, the 
latter is defined as the proportion of the variability on the dependent variable that is associated 
with the independent variable/experimental treatments. The measures described in this section 
are variously referred to as measures of effect size, measures of magnitude of treatment 
effect, measures of association, and correlation coefficients. 


Omega squared (Test 27g) The omega squared statistic is a commonly computed measure 
of treatment effect for the between-subjects factorial analysis of variance. Keppel (1991) and 
Kirk (1995) note that there is disagreement with respect to which variance components should 
be employed in computing omega squared for a factorial design. One method of computing 
omega squared (which computes a value referred to as standard omega squared) was 
employed in the previous edition of this book. The latter method expresses treatment variability 
for each of the factors as a proportion of the sum of all the elements that account for variability 
in a between-subjects factorial design (i.e., the variability for a given factor is divided by the sum 
of variability for all of the factors, interactions, and within-groups variability). A second method 
for computing omega squared computes what is referred to as partial omega squared (which 
was also computed in reference to the single-factor within-subjects analysis of variance). In 
computing the latter measure, which Keppel (1991) and Kirk (1995) view as more meaningful 
than standard omega squared, the proportion of variability for a given factor is divided by the 
sum of the proportion of variability for that factor and within-groups variability (i.e., variability 
attributable to other factors and interactions is ignored) 

Equations 27.5 1-27.53, respectively, summarize the elements that are employed to compute 
standard omega squared (Q2) for Factors A and B and the AB interaction (i.e., e bs. 
and Oy g represent standard omega squared for Factors A and B and the AB interaction).? 
Equations 27.48—27.50 represent the population parameters (w4, Op, and og) estimated 
by Equations 27.51-27.53. 
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(Equation 27.48) 


(Equation 27.49) 


(Equation 27.50) 


(Equation 27.51) 


(Equation 27.52) 


(Equation 27.53) 


Equations 27.57—27.59, respectively, summarize the elements that are employed to com- 
pute partial omega squared ((5?) for Factors A and B and the AB interaction. Equations 
p 


27.54-27.56 represent the population parameters estimated by Equations 27.57-27.59. 


pA 


pB 


OpAB 


Where: 
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(Equation 27.54) 


(Equation 27.55) 


(Equation 27.56) 


(Equation 27.57) 


(Equation 27.58) 


(Equation 27.59) 


ae 2 df, MS, - MSwe) _ (1)(84.5 - D) _ 


E npq (3)(2)(3) 

gi. Sa MSp - MS) _ 006 - D. 

i npq (3)(2)(3) 

a. = dfan (MS), - MS.) -28-1 
B npq ODA ` 


one = MSwg = 1 


Thus: 
as T 
4.64 + 2.78 + .78 +1 
ie, eae Ss 
4.64 + 2.78 + .78 +1 
CU ep 


4.64 + 2.78 + .78 +1 





2 4.64 

@,, = ———— = .82 
P^ 4.64 +1 

.2 2.78 

Gg = —— = .74 
P èë 278+1 

-2 .78 

Op = = .44 
pA 7841 


Equations 27.60-27.62 can also be employed to compute the values of partial omega 
squared. 


-2 p - DEF,- 
G4 = — 
(p - 1)(F, - 1) + npq 
(Equation 27.60) 
A (2 - 1)(84.5 - 1) = 
(2 - 1)(84.5 - 1) + ODB) ` 
.2 (q - DF, - 1) 
O58 LM ——————— 
(q - DG, - 1) + npq 
(Equation 27.61) 
- (3 - 1X26 - 1) E 
(3 - 16 - 1) + 3)Q)3) 
22 (p - D(q - DF, - 1) 
GAB = ———————————— 
(p - Iq - Dy, - 1) + npq 
(Equation 27.62) 


Q - DG - DE - 1) 2.24 
Q - DG - DE - 1) + BG) 
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The results of the above analysis for standard omega squared indicates that 5096 of the 
variability on the dependent variable is associated with Factor A (Humidity), 3096 with Factor 
B (Temperature), and 8% with the AB interaction. Thus, 50% + 30% + 8% = 88% of the 
variability on the dependent variable (problem-solving scores) is associated with variability on 
the two factors/independent variables and the interaction between them. It should be noted that 
although in some instances a small value for omega squared may indicate that the contribution 
of a factor or the interaction is trivial, this will not always be the case. Thus, in the example 
under discussion, although the value s - .08 is small relative to the omega squared values 
computed for the main effects, inspection of the data clearly indicates that in order to understand 
the influence of temperature on problem-solving scores, it is imperative that one take into 
account the level of humidity, and vice versa. 

The values computed for partial omega squared indicate that 82% of the variability on the 
dependent variable is associated with Factor A (Humidity), 7496 with Factor B (Temperature), 
and 44% with the AB interaction. Note that since the value computed for partial omega 
squared for a given factor does not take into account variability on the other factors or the 
interaction, it yields a much higher value for that factor and the interaction than standard omega 
squared computed for the same factor and the interaction. You should also note that when 
partial omega squared is computed, the sum of the proportions/percentage values can exceed 
1/100%. 

It was noted in an earlier discussion of omega squared (in Section VI of the ¢ test for two 
independent samples) that Cohen (1977; 1988, pp. 284—287) has suggested the following 
(admittedly arbitrary) values, which are employed in psychology and a number of other dis- 
ciplines, as guidelines for interpreting @: a) A small effect size is one that is greater than .0099 
but not more than .0588; b) A medium effect size is one that is greater than .0588 but not more 
than .1379; and c) A large effect size is greater than .1379. If one employs Cohen’s (1977, 
1988) guidelines for magnitude of treatment effect, all of the omega squared values computed 
in this section represent a large treatment effect, with the exception of Oris = .08, which 
represents a medium effect. 


Cohen’s f index (Test 27h) If, for a given factor, the value of partial omega squared is 
substituted in Equation 21.45, Cohen’s f index can be computed. In Section VI of the single- 
factor between-subjects analysis of variance, it was noted that Cohen’s f index is an alternate 
measure of effect size that can be employed for an analysis of variance. The computation of 
Cohen's f index with Equation 21.45 yields the following values: f, - 2.13, f, - 1.69, 


fag 7.89. 


1-.82 
T4 2460 

1- 74 
NEC NER 

1-44 





In the discussion of Cohen's f index in Section VI of the single-factor between-subjects 
analysis of variance, it was noted that Cohen (1977; 1988, pp. 284-288) employed the following 
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(admittedly arbitrary) f values as criteria for identifying the magnitude of an effect size: a) A 
small effect size is one that is greater than .1 but not more than .25; b) A medium effect size is 
one that is greater than .25 but not more than .4; and c) A large effect size is greater than .4. 
Employing Cohen's criteria, all of the values computed for f represent large effect sizes. 

A thorough discussion of the general issues involved in computing a measure of magnitude 
of treatment effect for a between-subjects factorial analysis of variance can be found in Keppel 
(1991) and Kirk (1995). Further discussion of the indices of treatment effect discussed in this 
section, and the relationship between effect size and statistical power can be found in Section IX 
(the Addendum) of the Pearson product-moment correlation coefficient under the discussion 
of meta-analysis and related topics. 


5. Computation of a confidence interval for the mean of a population represented by a 
group The same procedure employed to compute a confidence interval for a treatment popu- 
lation for the single-factor between-subjects analysis of variance can be employed with the 
between-subjects factorial analysis of variance to compute a confidence interval for the mean 
of any population represented by the pq groups. Although it will not be demonstrated here, the 
computational procedure requires that the appropriate values be substituted in Equation 21.48. 
In the event a researcher wants to compute a confidence interval for the mean of one of the levels 
of any of the factors, the number of subjects in the denominator of the radical of Equation 21.48 
is based on the number of subjects who served within each level of the relevant factor (i.e., nq 
in the case of Factor A and np in the case of Factor B). 


6. Additional analysis of variance procedures for factorial designs Section IX (the 
Addendum) provides a description of the following additional factorial analysis of variance 
procedures: a) The factorial analysis of variance for a mixed design (Test 27i); and b) The 
within-subjects factorial analysis of variance (Test 27j). The discussion of each of the 
aforementioned analysis of variance procedures includes the following: a) A description of the 
design for which it is employed; b) An example involving the same variables which are 
employed in Example 27.1; c) The computational equations for computing the appropriate F 
ratios; and d) Computation of the F ratios for the relevant example. 


VII. Additional Discussion of the Between-Subjects Factorial 
Analysis of Variance 


1. Theoretical rationale underlying the between-subjects factorial analysis of variance As 
noted in Section IV, as is the case for the single-factor between-subjects analysis of variance, 
the total variability for the between-subjects factorial analysis of variance can be divided into 
between-groups variability and within-groups variability. Although itis notrequired in order 
to compute the F ratios, the value MS, (which represents between-groups variability) can be 
used to represent the variance of the means of the pq groups. MS, can be computed with the 
equation MS, = SS,í/df, c. As noted earlier, between-groups variability is comprised of 
the following elements: a) Variability attributable to Factor A (represented by the notation 
MS ,), which represents the variance of the means of the p levels of Factor A; b) Variability 
attributable to Factor B (represented by the notation MS,), which represents the variance of 
the means of the q levels of Factor B; and c) Variability attributable to any interaction that 
is present between Factors A and B (represented by the notation MS ,,), which is a measure 
of variance that represents whatever remains of between-groups variability after the contributions 
of the main effects of Factors A and B have been subtracted from between-groups variability. 
In computing the three F ratios for the between-subjects factorial analysis of variance, 
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the values MS yt MS go and MS 4g are contrasted with MSc; which serves as a baseline measure 
of error variability. In other words, MS,,,, represents experimental error which results from 
factors that are beyond an experimenter's control. As is the case for the single-factor between- 
subjects analysis of variance, in the between-subjects factorial analysis of variance, MS yg 
is the normal amount of variability that is expected between the scores of different subjects who 
serve under the same experimental condition. Thus, MS, represents the average of the 
variances computed for each of the pg groups. As long as any of the elements that comprise 
between-groups variability (MS,, MS,, or MS ,,) are approximately the same value as within- 
groups variability (MS,,,,), the experimenter can attribute variability on a between-groups 
component to experimental error. When, however, any of the components that comprise 
between-groups variability are substantially greater than MS yg, it indicates that something over 
and above error variability is contributing to that element of variability. In such a case it is 
assumed that the inflated level of variability for the between-groups component is the result of 
a treatment effect. 


2. Definitional equations for the between-subjects factorial analysis of variance In the 
description of the computational protocol for the between-subjects factorial analysis of 
variance in Section IV, Equations 27.9-27.14 are employed to compute the values SS,, SS, 
SS,, SS,, SS,,, and SS,,.. The latter set of computational equations were employed, since 
they allow for the most efficient computation of the sum of squares values. As noted in Section 
IV, computational equations are derived from definitional equations which reveal the underlying 
logic involved in the derivation of the sums of squares. 

As noted previously, the total sum of squares (SS,.) can be broken down into two elements, 
the between-groups sum of squares (SS pg) and the within-groups sum of squares (SS yg). The 
contribution of any single subject's score to the total variability in the data can be expressed in 
terms of a between-groups component and a within-groups component. When the between- 
groups component and the within-groups component are added, the sum reflects that subject's 
total contribution to the overall variability in the data. Furthermore, the between-groups sum of 
squares can be broken down into three elements: a sum of squares for Factor A ( SS ,), a sum of 
squares for Factor B (SS,), and an interaction sum of squares (SS). The contribution of any 
single subject's score to between-groups variability in the data can be expressed in terms of an 
A, a B, and an AB component. When the A, B, and AB components for a given subject are 
added, the sum reflects that subject’s total contribution to between-groups variability in the data. 
The aforementioned information is reflected in the definitional equations which will now be 
described for computing the sums of squares. 

Equation 27.63 is the definitional equation for the total sum of squares." In Equation 
27.63 the notation Xj , 1$ employed to represent the score of the i "^ subject in the group that 
serves under Level j of Factor A and Level k of Factor B." When the notation 327 ,37 37 , 
precedes a term in parentheses, it indicates that the designated operation should be carried out 
for all N = npq subjects.” 


4 P n = 
SS; = 2 D (Xj ps a (Equation 27.63) 


k=1 j=l i=l 


In employing Equation 27.63 to compute SS,, the grand mean (X,) is subtracted from 
each of the N = npq scores and each of the N difference scores is squared. The total sum of 
squares (SS,,) is the sum of the N squared difference scores. Equation 27.63 is computationally 
equivalent to Equation 27.9. 

Equation 27.64 is the definitional equation for the between-groups sum of squares. 
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X (Equation 27.64) 


In employing Equation 27.64 to compute S$,., the following operations are carried out 
for each of the pq groups. The grand mean (X;.) is subtracted from the group mean (X AB; ). 

The difference score is squared, and the squared difference score is multiplied by the number of 
subjects in the group (n). After this is done for all pq groups, the values that have been obtained 
for each group as a result of multiplying the squared difference score by the number of subjects 
in a group are summed. The resulting value represents the between-groups sum of squares 
(SS,4). Equation 27.64 is computationally equivalent to Equation 27.10. 


Equation 27.65 is the definitional equation for the sum of squares for Factor A. 
poo o 
SS, = nY (X, - Xy (Equation 27.65) 
jio 


In employing Equation 27.65 to compute SS}, the following operations are carried out 
for each of the p levels of Factor A. The grand mean (X;) is subtracted from the mean of 
that level of Factor A (X, ). The difference score is squared, and the squared difference score 
is multiplied by the number of subjects in the level (n, = nq). After this is done for all p levels 
of Factor A, the values that have been obtained for each level as a result of multiplying the 
squared difference score by the number of subjects in a level are summed. The resulting value 
represents the sum of squares for Factor A (SS). Equation 27.65 is computationally equivalent 
to Equation 27.11. 

Equation 27.66 is the definitional equation for the sum of squares for Factor B. 


q = 
SS, = PX (X, - XY (Equation 27.66) 
k k 


In employing Equation 27.66 to compute SS,, the following operations are carried out 
for each of the q levels of Factor B. The grand mean (X,,) is subtracted from the mean of that 
level of Factor B (X, D The difference score is squared, and the squared difference score is 
multiplied by the number of subjects in the level (ny = np). After this is done for all q levels 
of Factor B, the values that have been obtained for each level as a result of multiplying the 
squared difference score by the number of subjects in a level are summed. The resulting value 
represents the sum of squares for Factor B (SS, ). Equation 27.66 is computationally equivalent 
to Equation 27.12. 

Equation 27.67 is the definitional equation for the interaction sum of squares. 





4 Dp 
=n), » (Xap - -Xg + xy (Equation 27.67) 


k=1 j=l 


In employing Equation 27.67 to compute SS,,, the following operations are carried out 
for each of the pg groups. The mean of the level of Factor A the group represents (X, » and 
the mean of the level of Factor B the group represents (X, i) are subtracted from the mean of 
the group (X,, ), and the grand mean (X,) is added to the resulting value. The result of the 
aforementioned: operation is squared, and the squared difference score is multiplied by the 
number of subjects in that group (n = n, B, ). After this is done for all pq groups, the values that 
have been obtained for each group are summed, and the resulting value represents the sum of 
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squares for the interaction (SS PP d Equation 27.67 is computationally equivalent to Equation 
27.13. 
Equation 27.68 is the definitional equation for the within-groups sum of squares. 


q p = 
SSy; = D X Y OG. - Xu Y (Equation 27.68) 


In employing Equation 27.68 to compute S$,,., the following operations are carried out 
for each of the pq groups. The group mean (X AB; ) is subtracted from each score in the group. 
The difference scores are squared, after which 'the sum of the squared difference scores is 
obtained. The sum of the sum of the squared difference scores for all pq groups represents the 
within-groups sum of squares. Equation 27.68 is computationally equivalent to Equation 27.14. 


3. Unequal sample sizes The equations presented in this book for the between-subjects 
factorial analysis of variance assume there is an equal number of subjects in each of the pq 
groups (i.e., the value of n, By is equal for each group). When the number of subjects per group 
is not equal, most sources recommend that adjusted sum of squares and sample size values be 
employed in conducting the analysis of variance. One approach to dealing with unequal sample 
sizes, which is generally referred to as the unweighted means procedure, employs the harmonic 
mean of the sample sizes of the pg groups to represent the value of n = n AB,’ Based on 


the computed value of the harmonic mean (which will be designated n, ), the oe size of each 
row and column, as well as the total sample size are adjusted as follows: = (n,Q). 

= (np), N = (n,)(p)(q). In addition, the ©X,, score of each group is n by 
ORCI the mean of the group derived from the original data by the value computed for the 
harmonic mean (i.e., (X, AB, jm) = = Adjusted value of XX AB; ). Employing the adjusted XX 
values, the value of EX. and the values of the sums of the rows (XX, ) and columns (XX, o 
adjusted accordingly. The adjusted values of XX}, UX yp, : MX, ; Ex, > Mag > Ny» Tg and 
N, are substituted in Equations 27.9—27.13 to compute the values SS, Sac. SS,, SS,, and 
SS,5. The value of SS\,,, on the other hand, is a pooled within-groups sum of squares that is 
based on the original unadjusted data. Thus, employing the original unadjusted values of 
YX, AB, and n AB the sum of squares is FOIE for each of the pq groups employing the 
following equation: Ex? AB; - [ÈX an.) /n 4p. ]- The sum of the pq sum of squares values 
represents the pooled Sui. -groups is of squares. This later value is, in fact, computed in 
Endnote 4. The values MS}, MS,, and MS,, are computed with Equations 27.15-27.17, by 
dividing the relevant sum of squares value by the appropriate degrees of freedom. The degrees 
of freedom for the aforementioned mean square values are computed with Equations 
27.19-27.21. Although the value MS, is computed with Equation 27.18, the value df, in the 
denominator of Equation 27.18 is a pooled within-groups degrees of freedom. The latter degrees 
of freedom value is determined by computing the value (n, By - 1) foreach group, and summing 
the (n, p 1) values for each of the pq groups. When the resulting degrees of freedom value 
is divided into the pooled within-groups sum of squares, it yields the value MS, that is 
employed in computing the F ratios. Equations 27.25-27.27 are employed to compute the values 
F,, F,, and F,,. Keppel (1991) notes that since the F values derived by the method described 
in this section may underestimate the likelihood of committing a Type I error, it is probably 
prudent to employ a lower tabled probability to represent the prespecified level of significance 
— i.e., employ the tabled critical F 9, value to represent the tabled critical F ,; value. Alternative 
methods for dealing with unequal sample sizes are described in Keppel (1991), Kirk (1982, 
1995), and Winer et al. (1991). 
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4. Final comments on the between-subjects factorial analysis of variance 

a) Fixed-effects versus random-effects versus mixed-effects models In Section VII of 
the single-factor between-subjects analysis of variance it is noted that one assumption 
underlying the analysis of variance is whether or not the levels of an independent variable are 
fixed or random. Whereas a fixed-effects model assumes that the levels of an independent 
variable are the same levels that will be employed in any attempted replication of the experiment, 
a random-effects model assumes that the levels have been randomly selected from the overall 
population of all possible levels that can be employed for the independent variable. The com- 
putational procedures for all of the analysis of variance procedures described in this book assume 
a fixed-effects model for all factors. 

In the case of factorial designs it is also possible to have a mixed-effects model, which is 
a combination of a fixed-effects and a random-effects model. Specifically, in the case of a two- 
factor design, a mixed-effects model assumes that one of the factors is based on a fixed-effects 
model while the second factor is based on a random-effects model. When there are three or more 
factors, a mixed-effects model assumes that one or more of the factors is based on a fixed-effects 
model and one or more of the factors is based on a random-effects model. Texts that specialize 
in the analysis of variance provide in-depth discussions of this general subject, as well as 
describing the modified equations that are appropriate for evaluating factorial designs that are 
based on random- and mixed-effects models. 

b) Nested factors/hierarchical designs and designs involving more than two factors In 
designing experiments that involve two or more factors, it is possible to employ what are referred 
to as nested factors. Nesting is present in an experimental design when different levels of one 
factor do not occur at all levels of another factor. To illustrate nesting, let us assume that a 
researcher wants to evaluate two teaching methods (which will comprise Factor A) in 10 different 
classes, each of which is unique with respect to the ethnic makeup of its students. The 10 
different classes will comprise Factor B. Five of the classes (B, ... B.) are taught by teaching 
method A, and the other five classes (B; ... B,,) by teaching method A,. In such a case Factor 
B is nested under Factor A, since each level of Factor B serves under only one level of Factor A. 
Figure 27.3 outlines the design.” 


A, A 
B, B, B, B, B; Bo B, B, B, By 


Figure 27.3 Example of a Nested Design 


Winer et al. (1991) note that in the above described design it is not possible to evaluate 
whether there is an interaction between the two factors. The reason for this is that in order to test 
for an interaction, it is necessary that all levels of Factor B must occur under all of the levels of 
Factor A. When the latter is true (as is the case in Example 27.1), the two factors are said to be 
crossed. The term hierarchical design is often employed to describe designs in which there are 
two or more nested factors. It is also possible to have a partially hierarchical design, in which 
at least one factor is nested and at least one factor is crossed. Since the statistical model upon 
which nested designs are based differs from the model that has been employed for the between- 
subjects factorial analysis of variance, the analysis of such designs requires the use of different 
equations. 

As noted earlier, a between-subjects factorial analysis of variance (as well as a factorial 
analysis of variance for a mixed design and a within-subjects factorial analysis of variance 
discussed in Section IX) can involve more than two factors. To further complicate matters, 
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designs with three or more factors can involve some factors that are nested and others that are 
crossed, plus the fact that a fixed-effects model may be assumed for some factors and a random- 
effects model for others. Honeck et al. (1983) is an excellent source for deriving the appropriate 
equations for the use of the analysis of variance with experimental designs involving nesting 
and/or the use of more than two factors. Other sources on this subject are Keppel (1991), Kirk 
(1982, 1995), and Winer et al. (1991). 


VIII. Additional Examples Illustrating the Use of the Between- 
Subjects Factorial Analysis of Variance 


Examples 27.2 and 27.3 are two additional examples that can be evaluated with the between- 
subjects factorial analysis of variance. Since the data for both examples are identical to that 
employed in Example 27.1, they yield the same result. Note that whereas in Example 27.1 both 
Factor A and Factor B are manipulated independent variables, in Example 27.2 Factor B is 
manipulated while Factor A is nonmanipulated (i.e., is a subject/attribute independent variable). 
In Example 27.3 both factors are nonmanipulated independent variables. 


Example 27.2 A study is conducted in order to evaluate the impact of gender (to be designated 
as Factor A) and anxiety level (to be designated as Factor B) on affiliation. The experimenter 
employs a 2 x 3 between-subjects (completely-randomized) factorial design. The two levels that 
comprise Factor Aare A,: Male; A,: Female. The three levels that comprise Factor B are B,: 
Low Anxiety; B,: Moderate Anxiety; B,: High Anxiety. Each of nine males and nine females 
is randomly assigned to one of three experimental conditions. All of the subjects are told they 
are participants in a learning experiment which will require them to learn lists of words. 
Subjects in the low anxiety condition are told that there will be no consequences for poor 
performance in the experiment. Subjects in the moderate anxiety condition are told if they 
perform below a certain level they will have to drink a distasteful beverage. Subjects in the high 
anxiety condition are told if they perform below a certain level they will be given a painful 
electric shock. All subjects are told that while waiting to be tested they can either wait by 
themselves or with other people. Each subject is asked to designate the number of people he or 
she would like in the room with him or her while waiting to be tested. This latter measure is 
employed to represent the dependent variable of affiliation. The experimenter assumes that the 
higher a subject is in affiliation, the more people the subject will want to be with while waiting. 
The affiliation scores of the three subjects in each of the six experimental groups/conditions 
(which result from the combinations of the levels that comprise the two factors) follow: Group 
AB: Male/Low anxiety (11,9, 10); Group AB,;: Male/Moderate anxiety (7, 8, 6); Group AB,;: 
Male/High anxiety (5, 4, 3); Group AB: Female/Low anxiety (2, 4, 3); Group AB,;: 
Female/Moderate anxiety (4, 5, 3); Group AB,,: Female/High anxiety (0, 1, 2). Do the data 
indicate that either gender or anxiety level influence affiliation? 


Example 27.3 A study is conducted in order to evaluate if there is a relationship between 
ethnicity (to be designated as Factor A) and socioeconomic class (to be designated as Factor 
B), and the number of times a year a person visits a doctor. The experimenter employs a 2 x 3 
between-subjects (completely-randomized) factorial design. The two levels that comprise Factor 
A are A,: Caucasian; A,: Afro-American. The three levels that comprise Factor B are B,: 
Lower socioeconomic class; B,: M. iddle socioeconomic class; B4: Upper socioeconomic class. 
Based on their occupation and income, each of nine Caucasians and nine Afro-Americans is 
categorized with respect to whether he or she is a member of the lower, middle, or upper 
socioeconomic class. Upon doing this the experimenter determines the number of times during 
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the past year each of the subjects has visited a doctor. This latter measure represents the 
dependent variable in the study. The number of visits for the three subjects in each of the six 
experimental groups/conditions (which result from the combinations of the levels that comprise 
the two factors) follow: Group AB,,: Caucasian/Lower socioeconomic class (11,9, 10); Group 
AB,,: Caucasian/Middle socioeconomic class (7, 8, 6); Group AB,,: Caucasian/Upper socio- 
economic class (5, 4, 3); Group AB,,: Afro-American/Lower socioeconomic class (2, 4, 3); 
Group AB,;  Afro-American/Middle socioeconomic class (4, 5, 3); Group AB,;: Afro- 
American/Upper socioeconomic class (0, 1, 2). Do the data indicate that either ethnicity or 
socioeconomic class is related to how often a person visits a doctor? 


IX. Addendum 
Discussion of additional analysis of variance procedures for factorial designs 


1. Test 27i: The factorial analysis of variance for a mixed design A mixed factorial design 
involves two or more independent variables/factors in which at least one of independent variables 
is measured between-subjects (different subjects serve under each of the levels of that inde- 
pendent variable) and at least one of the independent variables is measured within-subjects (the 
same subjects or matched sets of subjects serve under all of the levels of that independent 
variable). Although the factorial analysis of variance for a mixed design can be used with 
designs involving more than two factors, the computational protocol to be described in this sec- 
tion will be limited to the two-factor experiment. For purposes of illustration it will be assumed 
that Factor A is measured between-subjects (i.e., different subjects serve in each of the p levels 
of Factor A), and that Factor B is measured within-subjects (i.e., all subjects are measured on 
each of the q levels of Factor B). Since one of the factors is measured within-subjects, a mixed 
factorial design requires a fraction of the subjects that are needed to evaluate the same set of 
hypotheses with a between-subjects factorial design (assuming both designs employ the same 
number of scores in each of the pq experimental conditions). To be more specific, the fraction 
of subjects required is 1 divided by the number of levels of the within-subjects factor (i.e., 1/q 
if Factor B is the within-subjects factor). The advantages as well as the disadvantages of a 
within-subjects analysis (which are discussed under the f test for two dependent samples and 
the single-factor within-subjects analysis of variance) also apply to the within-subjects factor 
that is evaluated with the factorial analysis of variance for a mixed design. Probably the most 
notable advantage associated with the within-subjects factor is that it allows for a more powerful 
test of an alternative hypothesis when contrasted with the between-subjects factor. Example 27.4 
is employed to illustrate the use of the factorial analysis of variance for a mixed design. 


Example 27.4 A study is conducted to evaluate the effect of humidity (to be designated as 
Factor A) and temperature (to be designated as Factor B) on mechanical problem-solving 
ability. The experimenter employs a 2 x 3 mixed factorial design. The two levels that comprise 
Factor A are A,: Low humidity; A,: High humidity. The three levels that comprise Factor B 
are B,: Low temperature; B,: Moderate temperature; B,: High temperature. The study 
employs six subjects, three of whom are randomly assigned to Level 1 of Factor A and three of 
whom are randomly assigned to Level 2 of Factor A. Each subject is exposed to all three levels 
of Factor B. The order of presentation of the levels of Factor B is completely counterbalanced 
within the six subjects. The number of mechanical problems solved by the subjects in the six 
experimental conditions (which result from combinations of the levels of the two factors) follow: 
Condition AB,,: Low humidity/Low temperature (11, 9, 10); Condition AB,;: Low humidity/ 
Moderate temperature (7, 8, 6); Condition AB,,;: Low humidity/High temperature (5, 4, 3); 
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Condition AB,,: High humidity/Low temperature (2, 4, 3); Condition AB,,: High humidity/ 
Moderate temperature (4, 5, 3); Condition AB,,: High humidity/High temperature (0, 1, 2). Do 
the data indicate that either humidity or temperature influences mechanical problem-solving 
ability? 


The data for Example 27.4 are summarized in Table 27.7. 


Table 27.7 Data for Example 27.4 for Evaluation with the Factorial Analysis 
of Variance for a Mixed Design 


A, 
Subject sums 
B, B, B, (XS) 
Subject 1 11 7 5 XS, - 23 
Subject 2 9 8 4 XS, = 21 
Subject 3 10 6 3 XS, = 19 
Condition sums YX p. - 30 EX as, = 21 XXe, = 12 YX, - 63 
EX jn, = 302 XX = 149 XX, = 50 
A, 
Subject sums 
B, B, B, (XS,) 
Subject 4 2 4 0 XS, = 6 
Subject 5 4 5 1 XS, = 10 
Subject 6 3 3 2 YS, = 8 
Condition sums XX 4p, = 2 XX An, ^ 12 XX dh. 73 XX 4, = 24 
UX gn, = 29 EXis, = 50 EX, = 5 
XX, - 39 XX, - 33 XX, = 15 Mx. - 87 
XX; = 585 


Examination of Table 27.7 reveals that since the data employed for Example 27.4 are 
identical to that employed for Example 27.1, the summary values for the rows, columns, and 
pq experimental conditions are identical to those in Table 27.1. Thus, the following values in 
Table 27.7 are identical to those obtained in Table 27.1: n, = 9 and the values computed for 
YX, and XX, for each of the levels of Factor A; ng 
EX, and EX; ' for each of the levels of Factor B; ibus 
xx, and xx? AB i, for each of the pq experimental conditione that result from combinations of 
the EAT of the two factors; N = npq = 18; XX. = 87; EX% = 585. 

Note that in both the between-subjects factorial analysis of variance and the factorial 
analysis of variance for a mixed design, the value n ap, ae 3 represents the number of 


ie 6 and the values computed for 
= =n = 3 and the values computed for 


scores in each of the pq experimental conditions. In the case of the factorial analysis of var- 
iance for a mixed design, the value N = npq = 18 represents the total number of scores in the 
set of data. Note, however, that the latter value does not represent the total number of subjects 
employed in the study, as it does in the case of the between-subjects factorial analysis of 
variance. The number of subjects employed for a factorial analysis of variance for a mixed 
design will always be the value of n multiplied by the number of levels of the between-subjects 
factor. Thus, in Example 27.4 the number of subjects is np = (3)(2) = 6.” 
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Asisthe case for the between-subjects factorial analysis of variance, the following three 
F ratios are computed for the factorial analysis of variance for a mixed design: F}, Fp» 
F, g; The equations required for computing the F ratios are summarized in Table 27.8. Table 
27.9 summarizes the computations for the factorial analysis of variance for a mixed design 
when it is employed to evaluate Example 27.4. In order to compute the F ratios for the factorial 
analysis of variance for a mixed design, it is required that the following summary values 
(which are also computed for the between-subjects factorial analysis of variance) be 
computed: [XS], [T ], [A], [B], and [AB]. Since the summary values computed in Table 27.7 are 
identical to those computed in Table 27.1 (for Example 27.1), the same summary values are 
employed in Tables 27.8 and 27.9 to compute the values [XS], [T ], [A], [B], and [AB] (which are, 
respectively, computed with Equations 27.4—27.8). Thus: [XS] = 585, [T ] = 420.5, [A] = 505, 
[B] = 472.5, and [AB] = 573. Since the same set of data and the same equations are employed 
for the factorial analysis of variance for a mixed design and the between-subjects factorial 
analysis of variance, both analysis of variance procedures yield identical values for [XS], [T ], 
[A], [B], and [AB]. Inspection of Table 27.8 also reveals that the factorial analysis of variance 
for a mixed design and the between-subjects factorial analysis of variance employ the same 
equations to compute the values SS, SS, SS p SS... MS,, MS,, and MSp- 

In order to compute a number of additional sum of squares values for the factorial analysis 
of variance for a mixed design, it is necessary to compute the element [AS] (which is not 
computed for the between-subjects factorial analysis of variance). [AS], which is computed 
with Equation 27.69, is employed in Tables 27.8 and 27.9 to compute the following values: 
SS SS SS: SS 


Between -subjects ? Subjects WG’ Within -subjects ° B X subjects WG" 


np 


[AS] = 9; 


i=1 


(sy . 
——— (Equation 27.69) 








The notation x [C^S, Y!/q] in Equation 27.69 indicates that for each of the np = 6 sub- 
jects, the score of the subject is squared and divided by q. The resulting values obtained for the 
np subjects are summed, yielding the value [AS]. Employing Equation 27.69, the value [AS] 


= 510.33 is computed. 


[AS] = 510.33 





Q3 | Qiy , a% , (6? , 00 , (y 
3 3 3 3 3 3 


The reader should take note of the following relationships in Tables 27.8 and 27.9: 


SS Between-subjects > SS, ui SS Subjects WG 
SSwithin-subjects ^ 9p * 994p * SSe x subjects wG 


SS, = SS 


Between -subjects 


F SS within -subjects 
df, Between -subjects S df, u df. Subjects WG 


fini subjects ^ Se + fap * Ue x subjects we 


df, T df, Between -subjects t df Within-subjects 
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Table 27.8 Summary Table of Equations for the Factorial Analysis 
of Variance for a Mixed Design 





SOURCE of SS df MS F 
variation 
Between-subjects [AS]- [T ] np-1 
SS, MS, 
A [A]-IT ] p-1 = F, = — 
df, A MS subjects WG 
: SS subjects WG 
Subjects WG [AS]- [A] p(n-1) BE 
df siiects WG 
Within-subjects [XS]-[AS] np(q- 1) 
SS, MS, 
B [B]-[T ] q-1 LILB | PEU EN 
df, B MS y subjects WG 
SS ip MS, 
AB [AB]-[A]-[B}+IT ] (p- D(q- D F 43 = 
dfag MS, Xsubjects WG 
: SS p x subjects WG 
Bxsubjects WG [XS]-[AB]-[AS]+[A] p(q-D(n-1) | rer 
df, Bxsubjects WG 
Total [XS]-[T ] N-12npq-1 
Table 27.9 Summary Table of Computations for Example 27.4 
pource of SS df MS F 
variation 
Between- 
subjects 510.33-420.5=89.83 (3)(2)- 125 
A 505-420.5=84.5 2-1=1 MS,-93-84.5 F,-92-63.53 
1 1.33 
Subjects WG 510.33-505-5.33 28-024 — MSya wo 7 = 1.33 
Within- 
subjects 585-510.33=74.67 (3)(2)(3- 1)212 
B 472.5-420.5=52 3-1=2 MS,-2-26 4 F,-25-31.33 
2 .83 
16 8 
AB 573-505-472.54420.5-16  (2-1)(3-1)=2 M$,4,-5-8 — F-—.-9.64 
Bxsubjects 6.67 
WG 585-573-510.33450526.67 2(3-1)8-1) 28 MS, , subjects WwG^ 7-7 .83 
Total 585-420.5=164.5 18- 1=(3)(2)(3)- 1217 
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Inspection of Table 27.2 and Tables 27.8/27.9 reveals that if a between-subjects factorial 
analysis of variance and a factorial analysis of variance for a mixed design are employed with 
the same set of data, identical values are computed for the following: SS,, SS,, SS,,, SS 
df,, Uf, iz, drs MS,, MS,, and MS,,. 

In Table 27.9 the error term MSs uvjects wg = 1.33 (employed in computing the value 
F, = 63.53) is identical to the value MS yg which would be obtained if Factor B was not 
taken into account in Example 27.4, and the data on Factor A were evaluated with a single- 
factor between-subjects analysis of variance. The error term MS, x subjects WG ^ .83, 
employed in computing the values F, - 31.33 and F,, - 9.64 is analogous to the error term 
employed for the single-factor within-subjects analysis of variance. For a thorough discussion 
of the derivation of the error terms for the factorial analysis of variance for a mixed design, 
the reader should consult books that discuss analysis of variance procedures in greater detail 
(e.g., Keppel (1991) and Winer et al. (1991)). 

The following tabled critical values derived from Table A10 are employed in evaluating 
the three F ratios computed for Example 27.4: a) Factor A: For df, = df, = 1 and 
dfe = dfs uvjects wo = 4. Fos = 7.71 and FQ, = 21.20; b) Factor B: For df, = df, = 2 
and dfin = df, x subjects WG ~ 8, Fos = 4.46 and F,, = 8.65; and c) AB interaction: For 
Boum = ap = 2 and dfau, = Be x subjects wg = 8. Fos = 4.46 and Fy, = 8.65. 

The identical null and alternative hypotheses that are evaluated in Section III of the 
between-subjects factorial analysis of variance are evaluated in the factorial analysis of 
variance for a mixed design. In order to reject the null hypothesis in reference to a computed 
F ratio, the obtained F value must be equal to or greater than the tabled critical value at 
the prespecified level of significance. Since the computed value F} = 63.53 is greater than F o5 
= 7.71 and F y, = 21.20, the alternative hypothesis for Factor A is supported at both the .05 and 
.01 levels. Since the computed value Fẹ = 31.33 is greater than F,, = 4.46 and F,, = 8.65, 
the alternative hypothesis for Factor B is supported at both the .05 and .01 levels. Since the 
computed value F,, = 9.64 is greater than Fo; = 4.46 and F,, = 8.65, the alternative 
hypothesis for an interaction between Factors A and B is supported at both the .05 and .01 levels. 

The analysis of the data for Example 27.4 allows the researcher to conclude that both 
humidity (Factor A) and temperature (Factor B) have a significant impact on problem-solving 
scores. However, as is the case when the same set of data is evaluated with a between-subjects 
factorial analysis of variance, the relationships depicted by the main effects must be qualified 
because of the presence of a significant interaction. Although the comparison procedures follow- 
ing the computation of the omnibus F ratios (as well as the other analytical procedures for 
determining power, effect size, etc.) described in Section VI of the between-subjects factorial 
analysis of variance can be extended to the factorial analysis of variance for a mixed design, 
they will not be described in this book. For a full description of such procedures, the reader 
should consult texts that discuss analysis of variance procedures in greater detail (e.g., Keppel 
(1991) and Winer et al. (1991)). 


T? 


2. Test 27j: The within-subjects factorial analysis of variance A within-subjects factorial 
design involves two or more factors, and all subjects are measured on each of the levels of all 
of the factors. The within-subjects factorial analysis of variance (also known as a repeated- 
measures factorial analysis of variance) is an extension of the single-factor within-subjects 
analysis of variance to experiments involving two or more independent variables/factors. 
Although the within-subjects factorial analysis of variance can be used with designs involving 
more than two factors, the computational protocol to be described in this section will be limited 
to the two-factor experiment. Within the framework of the within-subjects factorial design, 
each subject contributes pq scores (which result from the combinations of the levels that comprise 
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the two factors). Since subjects serve under all pg experimental conditions, a within-subjects 
factorial design requires a fraction of the subjects that are needed to evaluate the same set of 
hypotheses with either the between-subjects factorial design or the mixed factorial design 
(assuming a given design employs the same number of scores in each of the pq experimental 
conditions). To be more specific, only 1/pq " of the subjects are required for a within-subjects 
factorial design in contrast to a between-subjects factorial design. The requirement of fewer 
subjects, and the fact that a within-subjects analysis provides for a more powerful test of an 
alternative hypothesis than a between-subjects analysis, must to be weighed against the fact that 
it is often impractical or impossible to have subjects serve in multiple experimental conditions. 
In addition, a within-subjects analysis of variance is more sensitive to violations of its assump- 
tions than a between-subjects analysis of variance. Example 27.5 is employed to illustrate the 
within-subjects factorial analysis of variance. 


Example 27.5 A study is conducted to evaluate the effect of humidity (to be designated as 
Factor A) and temperature (to be designated as Factor B) on mechanical problem-solving 
ability. The experimenter employs a 2 x 3 within-subjects factorial design. The two levels that 
comprise Factor A are A,: Low humidity; A,: High humidity. The three levels that comprise 
Factor B are B,: Low temperature; B,: Moderate temperature; B4: High temperature. The 
study employs three subjects, all of whom serve under the two levels of Factor A and the three 
levels of Factor B. The order of presentation of the combinations of the two factors is in- 
completely counterbalanced.” The number of mechanical problems solved by the subjects in the 
six experimental conditions (which result from combinations of the levels of the two factors) 
follow: Condition AB,,: Low humidity/Low temperature (11, 9, 10); Condition AB,,: Low 
humidity/Moderate temperature (7, 8, 6); Condition AB: Low humidity/High temperature 
(5, 4, 3); Condition AB,,: High humidity/Low temperature (2, 4, 3); Condition AB,,: High 
humidity/Moderate temperature (4, 5, 3); Condition AB}: High humidity/High temperature 
(0, 1, 2). Do the data indicate that either humidity or temperature influences mechanical 
problem-solving ability? 


The data for Example 27.5 are summarized in Tables 27.10-27.12. In Table 27.11, 
S, represents the score of Subject i under Level j of Factor A. In Table 27.12, S, rep- 
Aj By 


resents the score of Subject i under Level k of Factor B. 

Examination of Tables 27.10—27.12 reveals that since the data employed for Example 27.5 
are identical to that employed for Examples 27.1 and 27.4, the summary values for the rows, 
columns, and pq experimental conditions are identical to those in Tables 27.1 and 27.7. Thus, 
the following values in Tables 27.10—27.12 are identical to those obtained in the tables for 
Examples 27.1 and 27.4: n, = 9 and the values computed for XX, and XX, . for each of the 
levels of Factor A; ng. -6 anid the values computed for YX, and x? for sach of the levels 
X, and x for each of the 
pq experimental candita that result from combinations of the levels of "ili two factors; 
N = npq = 18; XX, = 87; XX; = 585. 

Note that in the within-subjects factorial analysis of variance, the between-subjects 
factorial analysis of variance, and the factorial analysis of variance for a mixed design, the 
value nN,, = n = 3 represents the number of scores in each of the pq experimental conditions. 


of Factor B; n AB, = " - 3 and the values computed for xx 


In the case of the within-subjects factorial analysis of variance, the value N = npq = 18 
represents the total number of scores in the set of data. Note, however, that the latter value does 
not represent the total number of subjects employed in the study as it does in the case of the 
between-subjects factorial analysis of variance. The number of subjects employed for a 
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within-subjects factorial analysis of variance will always be the value of n = n,, . Thus, in 
jk 
Example 27.5 the number of subjects is n = n,, = 3. : 
jk 


Table 27.10 Data for Example 27.5 for Evaluation with the Within-Subjects 
Factorial Analysis of Variance 


Ai A, Subject 

sums 

B, B, B, B, B, B, (XS,) 
Subject 1 11 7 5 2 4 0 XS, - 29 
Subject 2 9 8 4 4 5 1 US, = 31 
Subject 3 10 6 3 3 3 2, XS, 0 
Condition YX u^ 30 YX B 21 YX 4p = 12 YX yp = 9 YX jp=12 YX 73 YXX.- 87 

2. 2 2 2 2 2 2 

Sums YX, = 302 UX 4p,,~ 149 UX jp,,- 50 UX ip, 729 UX 4p,,= 50 UX 43,75 UX; = 585 


Table 27.11 Scores of Subjects on Levels of Factor A for Example 27.5 


Subject sums 


A, A, (XS, 

Subject 1 S =23 S, = 6 XS, = 29 
1 2 

Subject 2 5, =21 5, =10 XS, =31 
1 2 

Subject 3 $, =19 5, = 8 XS, = 27 
i 2 

Sums for levels 

of Factor A EX, =63 XX, -24 YX, = 87 


Table 27.12 Scores of Subjects on Levels of Factor B for Example 27.5 


Subject sums 


B, B, B, (XS,) 
Subject 1 $, 713 $, 71 $73 YS, -29 
Subject 2 $, =13 $, =13 "LEE XS,-31 
Subject 3 $, =13 S= 9 s= 5 ES, =27 
papia "dn XX, 239 EX, =33 EX, =15 EX, =87 


As is the case for the between-subjects factorial analysis of variance and the factorial 
analysis of variance for a mixed design, the following three F ratios are computed for the 
within-subjects factorial analysis of variance: F,, F,, F,,. The equations required for 
computing the F ratios are summarized in Table 27.13. Table 27.14 summarizes the compu- 
tations for the within-subjects factorial analysis of variance when it is employed to evaluate 
Example 27.5. In order to compute the F ratios for the within-subjects factorial analysis of 
variance, it is required that the following summary values (which are also computed for the 
between-subjects factorial analysis of variance and the factorial analysis of variance for a 
mixed design) be computed: [XS], [T ], [A], [5], and [AB]. Since the summary values computed 
in Tables 27.10—27.12 are identical to those computed in Tables 27.1 and 27.7 (for Example 27.1 
and Example 27.4), the same summary values are employed in Tables 27.13 and 27.14 to compute 
the values [XS], [T ], [A], [B], and [AB] (which are, respectively, computed with Equations 
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27.4-27.8). Thus: [XS] = 585, [T ] = 420.5, [A] = 505, [B] = 472.5, and [AB] = 573. Since 
the same set of data and the same equations are employed for the within-subjects factorial 
analysis of variance, the between-subjects factorial analysis of variance, and the factorial 
analysis of variance for a mixed design, all three analysis of variance procedures yield identical 
values for [XS], [T ], [A]. [B], and [AB]. Inspection of Table 27.13 also reveals that the within- 
subjects factorial analysis of variance, the between-subjects factorial analysis of variance, 
and the factorial analysis of variance for a mixed design employ the same equations to 
compute the values SS,, SS,, SS,,, SS,, MS,, MSp, and MS,, 

In order to compute a aber of additional sum of squares values for the within-subjects 
factorial analysis of variance, it is necessary to compute the following three elements which are 
not computed for the between-subjects factorial analysis of variance: [S], [AS], and [BS]. 

[S], which is computed with Equation 27.70, is employed in Tables 27.13 and 27.14 to 
compute the following values: SS, SS SS SS 


Between -subjects ? Within-subjects ° A X subjects? B X subjects? 
SS 


AB X subjects ` 


(Equation 27.70) 








The notation X QS, Y/pq] in Equation 27.70 indicates that for each of the n = 3 
subjects, the sum of that subject's three scores (i.e., XS)i is squared and divided by pq. The 
resulting values obtained for the n = 3 subjects are iied, yielding the value [S]. Employing 
Equation 27.70, the value [S] = 421.83 is computed. 


= 421.83 


2 2 2 
[S] - I , GIY , Q7) 
6 6 


6 


[AS], which is computed with Equation 27.71, is employed in Tables 27.13 and 27.14 to 


compute the following values: SS, .. gibet? SS px subjects“ 


CSF 





[AS] = (Equation 27.71) 


n p 
=1 








q 


=1 j 


The notation 37. ye ÈS, be q] in Equation 27.71 indicates that each of the p = 2 
S, scores of the n = 3 Ai is squared and divided by q = 3. The resulting np = 6 values are 
surnamed, yielding the value [AS]. Employing Equation 27.71, the value [AS] = 510.33 is 
computed (which is the same value computed for [AS] when the same set of data are evaluated 
with the factorial analysis of variance for a mixed design). 


= 510.33 


us... OF, ex, OOF o, er. 


3 3 3 3 


[BS], which is computed with Equation 27.72, is employed in Tables 27.13 and 27.14 to 





compute the following values: SS, x — SS px subjects" 
n. QS, y» 
[BS] Y : (Equation 27.72) 
i=1 k=1 P 
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The notation 37 X% IOS, F /p] in Equation 27.72 indicates that each of the q = 3 
$, scores of the n = 3 subjects i i ‘squared and divided by p = 2. The resulting nq = 9 values 
Bk 


are summed, yielding the value [BS]. Employing Equation 27.72, the value [BS] 476.5 is 
computed. 


[BS] = = 476.5 








2 2 2 2 2 2 2 2 
13» 01? , 6, (By, 13? , 6 , (BY, (Y , 6 

2 2 2 2 2 2 2 2 2 
The reader should take note of the following relationships in Tables 27.13 and 27.14: 


SS, 


Within -subjects 


= SS, 85,5 S8, + SS 


A X subjects 


+ SS, 


Within-subjects 


+ SS 


B X subjects 


* SS 


AB X subjects 


$$, = SS 


Between -subjects 
Tf iin subjects = df, * df, T Ff sp df, X subjects t dfg x subjects 2 df ip x subjects 


df, Tc Ub obsides B df Within- subjects 


Inspection of Table 27.2, Tables 27.13/27.14, and Tables 27.8/27.9 reveals that if a 
between-subjects factorial analysis of variance, a within-subjects factorial analysis of var- 
iance, and a factorial analysis of variance for a mixed design are employed with the same set 
of data, identical values are computed for the following: SS,, SS,, SS,,, SS;, dfa, dfg» 
df,p. dr, MS,, MS,, and MS, 

In Table ^1. 14, the error ecd MS, x siiis = 2, employed in computing the value 
F, = 42.25, is analogous to the error term that would be obtained if in evaluating the data for 
Example 27.5, Factor B was not taken into account, and the data on Factor A were evaluated with 
a single-factor within-subjects analysis of variance. The error term MS, , dime" .67, 
employed in computing the value F, = 38.81, is analogous to the error term that would be 
obtained if, in evaluating the data for Example 27.5, Factor A was not taken into account, and 
the data on Factor B were evaluated with a single-factor within-subjects analysis of variance. 
The value MS, x subjects = 1, employed in computing the value F,, = 8, is a measure of 
error variability specific to the AB interaction for the within-subjects factorial analysis of 
variance. For a thorough discussion of the derivation of the error terms for the within-subjects 
factorial analysis of variance, the reader should consult books which discuss analysis of 
variance procedures in greater detail (e.g., Keppel (1991) and Winer et al. (1991)). 

The following tabled critical values derived from Table A10 are employed in evaluating 
the three F ratios computed for Example 27.5: a) Factor A: For df om = df, = 1 and 
dfin = Ya x "a 2, Fos = 18.51 and F,, = 98.50; b) Factor B: For df am = df, = 2 
and dfin = df, x eee = 4, Fo, = 6.94 and F, = 18.00; and c) AB interaction: For 
Boum = Usp = 2 and dfi, = Bap x ghee = 4, Fo; = 6.94 and F,, = 18.00. 

The identical null and alternative hypotheses that are evaluated in Section III of the 
between-subjects factorial analysis of variance are evaluated in the within-subjects factorial 
analysis of variance. In order to reject the null hypothesis in reference to a computed F 
ratio, the obtained F value must be equal to or greater than the tabled critical value at the 
2 level of significance. Since the computed value F, - 42.25 is greater than 

= 18.51, the alternative hypothesis for Factor A is SUPENA. but only at the .05 level. 
Ae the computed value Fẹ = 38.81 is greater than F; = 6.94 and F,, = 18.00, the 
alternative hypothesis for Factor B is supported at both the .05 and .01 levels. Since the 
computed value F,, = 8 is greater than F; = 6.94, the alternative hypothesis for an 
interaction between Factors A and B is supported, but only at the .05 level.” 
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Table 27.13 Summary Table of Equations for the Within-Subjects Factorial Design 


Source of 











variation S5 af MS 2 
Between-subjects [S]- LT ] n-1 
Within-subjects [XS]- [S] n(pq- 1) 
SS, F MS, 
A [A]-[T ] p-1 EX Ht ETT m 
df. A MS , supjects 
SS, F MS, 
B [B]-[T ] q-1 PT BO XI 
df, B MS, , subjects 
SS ip MS, 
AB [AB]- [A]- [B H[T ] (p-1)(q-1) Pape 
df, AB M Sup X subjects 
À = SS 4 x subjects 
Axsubjects [AS]-[A]-[S]+[T] (p- 1)(n-1) A =i 4 
df, X subjects 
e 3 SS p x subjects 
Bxsubjects [BS]- [B]- [S]- LT ] (q- 1)n- 1) —— 
df, Bxsubjects 
SS . 
. [XS] = [AB] a [AS] = [BS] i nt H AB x subjects 
ABxsubjects -H[AT-[BIHS]- T ] (p- 1Y(q- 1)(n- 1) E 
Total [XS]- [T ] N-1-2npq-1 
Table 27.14. Summary Table of Computations for Example 27.5 
sourceo SS df MS F 
variation 
Between 421.83-420.5=1.33 3-1=2 
subjects 


Within-subjects 585-421.83=163.17 


A 505-420.5=84.5 

B 472.5-420.5=52 

AB 573-505-472.5+420.5=16 
Axsubjects Me ee 
panies — "linteum 
a PEST SIO 650 
Total 585-420.5-164.5 
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3[(2)(3)- 11-15 


2-1=1 
3-122 MS, =~ =26 
16 
(2-1)-1)=2 Meg. 8 
Q-D8-D-2 MS, sas 


(3-1)3-1)=4 MSgxsubjects =—— 


18- 1=(3)(2)(3)- 1217 


MS,-93-845 F,-93-4225 


- 26 _ 
F, =% =38.81 


4 
(2- DG- DG- 1)24 MS p x subjects D 4 E 1 


The analysis of the data for Example 27.5 allows the researcher to conclude that both 
humidity (Factor A) and temperature (Factor B) have a significant impact on problem-solving 
scores. However, as is the case when the same set of data is evaluated with a between-subjects 
factorial analysis of variance, the relationships depicted by the main effects must be qualified 
because of the presence of a significant interaction. Although the comparison procedures fol- 
lowing the computation of the omnibus F ratios (as well as the other analytical procedures for 
determining power, effect size, etc.) described in Section VI of the between-subjects factorial 
analysis of variance can be extended to the within-subjects factorial analysis of variance, they 
will not be described in this book. For a full description of such procedures, the reader should 
consult texts that discuss analysis of variance procedures in greater detail (e.g., Keppel (1991) 
and Winer et al. (1991)). 
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Endnotes 


1. A main effect refers to the effect of one independent variable on the dependent variable, 
while ignoring the effect any of the other independent variables have on the dependent 
variable. 


2. Although it is possible to conduct a directional analysis, such an analysis will not be 
described with respect to a factorial analysis of variance. A discussion of a directional 
analysis when an independent variable is comprised of two levels can be found under the 
t test for two independent samples. In addition, a discussion of one-tailed F values can 
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10. 


11. 


be found in Section VI of the latter test under the discussion of the Hartley's F nax test for 
homogeneity of variance/F test for two population variances. A discussion of the 
evaluation of a directional alternative hypothesis when there are two or more groups can 
be found in Section VII of the chi-square goodness-of-fit test (Test 8). Although the 
latter discussion is in reference to analysis of a k independent samples design involving 
categorical data, the general principles regarding the analysis of a directional alternative 
hypothesis are applicable to the analysis of variance. 


The notational system employed for the factorial analysis of variance procedures described 
in this chapter is based on Keppel (1991). 


The value SS,,. = 12 can also be computed employing the following equation: 


q Pp EX ar) 
SS = D Do XX —> 
k=1 j=l aB, 
2 2 2 2 2 2 
= |a02- 90 ^, 59 Y | + 149 D | + [50-82] + [59 02^, 5 Y 
3 3 3 3 3 3 


























=24+24+2+2+2+2=12 


Note that in the above equation a within-groups sum of squares is computed for each 
of the pg = 6 groups, and SS,,,, = 12 represents the sum of the six sum of squares values. 


This averaging protocol only applies when there is an equal number of subjects in the 
groups represented in the specific row or column for which an average is computed. 


If the factor represented on the abscissa is comprised of two levels (as is the case in Figure 
27.1a), when no interaction is present the lines representing the different levels of the 
second factor will be parallel to one another by virtue of being equidistant from one 
another. When the abscissa factor is comprised of more than two factors, the lines can be 
equidistant but not parallel when no interaction is present. 


As noted earlier, the fact that the lines are parallel to one another is not a requirement if no 
interaction is present when the abscissa factor is comprised of three or more levels. 


If no interaction is present, such comparisons should yield results that are consistent with 
those obtained when the means of the levels of that factor are contrasted. 


As noted in Section VI of the single-factor between-subjects analysis of variance, a 
linear contrast is equivalent to multiple ¢ tests/Fisher’s LSD test. 


Many researchers would elect to employ a comparison procedure that is less conservative 
than the Scheffé test, and thus would not require as large a value as CD, in order to reject 
the null hypothesis. 


The number of pairwise comparisons is [k(k -1)]/2 = [6(6 - 1)]/2 = 15, where k = pq 
= (2)(3) = 6 represents the number of groups. 
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12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


If Tukey’s HSD test is employed to contrast pairs or sets of marginal means for Factors 
A and B, the values q, A, dn) and VB, dye) > respectively, employed from Table A13. 
The sample sizes used in Equation 27.45 for Factors A and B are, respectively, ng and np. 


When there are only two levels involved in analyzing the simple effects of a factor (as is 
the case with Factor A), the procedure to be described in this section will yield an F value 
for a simple effect that is equivalent to the F — value that can be computed by comparing 
the two groups employing the linear contrast procedure described earlier (i.e., the 
procedure in which Equation 27.40 is employed to compute SS on) 

The equation for computing the sum of squares for each of the simple effects of Factor A 
is noted below. 


QE — EEX yy F 





SS, at B, > i np 


If XX Ap , represents the sum of the scores on Level k of Factor B of subjects who 
serve under a specific level of Factor A, the notation X[(CX AB y /n] in the above equation 
indicates that the sum of the scores for each level of Factor A at a given level of Factor B 
is squared, divided by n, and the p squared sums are summed. The notation (XXX, B X 
represents the square of the sum of scores of the np subjects who serve under the specified 
level of Factor B. 


In the case of the simple effects of Factor A, the modified degrees of freedom value is 
dfyg = pn - 1). 


The fact that in the example under discussion the tabled critical values employed for 
evaluating F aay are extremely large is due to the small value of n. However, under the 
discussion of homogeneity of variance under the single-factor between-subjects analysis 
of variance, it is noted that Keppel (1991) suggests employing a more conservative test 
anytime the value of F ax > 3. 

The fact that MS yg is an unbiased estimate of Ong can be confirmed by the fact that in the 
discussion of the homogeneity of variance assumption in the previous section, it is noted 
that the estimated population variance of each group is SiB, = 1. The latter value is 


equivalent to the value MS wg = 1 computed for the factorial analysis of variance. 


The procedure described in this section assumes there is an equal number of subjects in 
each group. If the latter is true, it is also the case for Example 27.1 that uz; = (u, + #4 )/2 
1 2 


and ug = (Hap, * Bap, * Pag, * Map, * Map, * Lag, )/6. 


Different but equivalent forms of Equations 27.51—27.53 were employed to compute 
standard omega squared in the first edition of this book. 


Fora clarification of the use of multiple summation signs, the reader should review Endnote 
63 under the single-factor between-subjects analysis of variance and Endnote 19 under 
the single-factor within-subjects analysis of variance. 
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21. 


22. 


23. 


24. 


25. 


26. 


27. 


28. 


The notation X; , 1$ a simpler form of the notation X, , which is more consistent with the 


AB; 

notational format used throughout the discussion of the between-subjects factorial 

analysis of variance. 

The notation 17,35 , 37 ,X;,, is an alternative way of writing EX}. Le LF LX ip 

indicates that the scores of each of the n = n,, subjects in each of the pq groups are 
jk 

summed. 


Since the interaction sum of squares is comprised of whatever remains of between-groups 
variability after the contributions of the main effects for Factor A and Factor B have been 
removed, Equation 27.67 can be derived from the equation noted below which subtracts 
Equations 27.65 and 27.66 from Equation 27.64. 


AS — = Eu S © — d 
$$, =n) > (Xap, - p^ - ng», (X, - Xy! - mpd (Xs, - » 
i j=1 7 k=1 


k=1 j=l 


The computation of the harmonic mean is described in Section VI of the ¢ test for two 
independent samples. 


Some sources note that the subjects employed in such an experiment (or for that matter any 
experiment involving independent samples) are nested within the level of the factor to 
which they are assigned, since each subject serves under only one level of that factor. 


The computational procedure for the factorial analysis of variance for a mixed design 
assumes that there is an equal number of subjects in each of the levels of the between- 
subjects factor. When the latter is not true, adjusted equations should be employed which 
can be found in books that describe the factorial analysis of variance for a mixed design 
in greater detail. 


There are 12 possible presentation orders involving combinations of the two factors 
(p!q! = 3!2! = 12). The sequences for presentation of the levels of both factors are 
determined in the following manner: If A, is followed by A, , presentation of the levels of 
Factor B can be in the six following sequences: 123, 132, 213, 231, 312, 321. If A, is 
followed by A,, presentation of the levels of Factor B can be in the same six sequences 
noted previously. Thus, there are a total of 12 possible sequence combinations. Since there 
are only six subjects in Example 27.5, only six of the 12 possible sequence combinations 
can be employed. 


If Factors A and B are both within-subjects factors and a significant effect is present for 
the main effects and the interaction, the within-subjects factorial analysis of variance 
would be the most likely of the three factorial analysis of variance procedures discussed 
to yield significant F ratios. The F}, Fẹ, and F,, values obtained in Examples 27.1 
and 27.4 are significant at both the .05 and .01 levels when the data are, respectively, 
evaluated with a between-subjects factorial analysis of variance and a factorial analysis 
of variance for a mixed design. However, when Example 27.5 is evaluated with the 
within-subjects factorial analysis of variance, although F, is significant at both the 
.05 and .01 levels, F} and F,, are only significant at the .05 level. This latter result can 
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be attributed to the fact that the data set employed for the three examples is hypothetical, 
and is not based on the scores of actual subjects who were evaluated within the framework 
of a within-subjects factorial design. In point of fact, in the case of the within-subjects 
factorial analysis of variance, the lower value for df,,,, employed for a specific effect (in 
contrast to the values of df., employed for the between-subjects factorial analysis of 
variance and the factorial analysis of variance for a mixed design) will be associated 
with a tabled critical F value that is larger than the values employed for the latter two tests. 
Thus, unless there is an actual correlation between subjects’ scores under different condi- 
tions (which should be the case if a variable is measured within-subjects), the loss of 
degrees of freedom will nullify the increase in power associated with the within-subjects 
factorial analysis of variance (assuming the data are derived from the appropriate design). 
The superior power of the within-subjects factorial analysis of variance derives from the 
smaller MS error terms employed in evaluating the main effects and interaction. 
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Test 31: 
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Measures of 
Association/Correlation 


The Pearson Product-Moment Correlation Coefficient 
Spearman's Rank-Order Correlation Coefficient 
Kendall’s Tau 

Kendall’s Coefficient of Concordance 


Goodman and Kruskal’s Gamma 
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Test 28 


The Pearson Product-Moment Correlation Coefficient 
(Parametric Measure of Association/Correlation 
Employed with Interval/Ratio Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


The Pearson product-moment correlation coefficient is one of a number of measures of 
correlation or association discussed in this book. Measures of correlation are not inferential 
statistical tests, but are, instead, descriptive statistical measures that represent the degree of rela- 
tionship between two or more variables. Upon computing a measure of correlation, it is common 
practice to employ one or more inferential statistical tests in order to evaluate one or more 
hypotheses concerning the correlation coefficient. The hypothesis stated below is the most 
commonly evaluated hypothesis for the Pearson product-moment correlation coefficient. 


Hypothesis evaluated with test In the underlying population represented by a sample, is the 
correlation between subjects’ scores on two variables some value other than zero? The latter 
hypothesis can also be stated in the following form: In the underlying population represented 
by the sample, is there a significant linear relationship between the two variables? 


Relevant background information on test Developed by Pearson (1896, 1900), the Pearson 
product-moment correlation coefficient is employed with interval/ratio data to determine the 
degree to which two variables covary (i.e., vary in relationship to one another). Any measure of 
correlation/association that assesses the degree of relationship between two variables is referred 
to as a bivariate measure of association. In evaluating the extent to which two variables covary, 
the Pearson product-moment correlation coefficient determines the degree to which a linear 
relationship exists between the variables. One variable (usually designated as the X variable) is 
referred to as the predictor variable, since if indeed a linear relationship does exist between the 
two variables, a subject's score on the predictor variable can be used to predict the subject's 
score on the second variable. The latter variable, which is referred to as the criterion variable, 
is usually designated as the Y variable.! The degree of accuracy with which a researcher will be 
able to predict a subject's score on the criterion variable from the subject's score on the predictor 
variable will depend upon the strength of the linear relationship between the two variables. The 
use of correlational data for predictive purposes is summarized under the general subject of 
regression analysis (or more formally, linear regression analysis, since, when prediction is 
discussed in reference to the Pearson product-moment correlation coefficient, it is based on 
the degree of linear relationship between the two variables). A full discussion of regression 
analysis can be found in Section VI. 

The statistic computed for the Pearson product-moment correlation coefficient is rep- 
resented by the letter r. r is an estimate of p (the Greek letter rho), which is the correlation 
between the two variables in the underlying population. r can assume any value within the range 
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of —1 to +1 (ie., -1 < r< +1). Thus, the value of r can never be less than —1 (i.e., r cannot 
equal - 1.2, -50, etc.) or be greater than +1 (i.e., r cannot equal 1.2, 50, etc.). The absolute 
value of r (i.e., |r|) indicates the strength of the relationship between the two variables. As the 
absolute value of r approaches 1, the degree of linear relationship between the variables be- 
comes stronger, achieving the maximum when |r| = 1 (i.e., when r equals either +1 or - 1). The 
closer the absolute value of r is to 1, the more accurately a researcher will be able to predict a 
subject's score on one variable from the subject's score on the other variable. The closer the 
absolute value of r is to 0, the weaker the linear relationship between the two variables. As the 
absolute value of r approaches 0, the degree of accuracy with which a researcher can predict a 
subject’s score on one variable from the other variable decreases, until finally, when r = 0 there 
is no predictive relationship between the two variables. To state it another way, when r = 0 the 
use of the correlation coefficient to predict a subject's X score from the subject's Y score (or vice 
versa) will not be any more accurate than a prediction that is based on some random process (i.e., 
a prediction that is based purely on chance). 

The sign of r indicates the nature or direction of the linear relationship that exists between 
the two variables. A positive sign indicates a direct linear relationship, whereas a negative sign 
indicates an indirect (or inverse) linear relationship. A direct linear relationship is one in which 
a change on one variable is associated with a change on the other variable in the same direction 
(i.e., an increase on one variable is associated with an increase on the other variable, and a de- 
crease on one variable is associated with a decrease on the other variable). When there is a direct 
relationship, subjects who have a high score on one variable will have a high score on the other 
variable, and subjects who have a low score on one variable will have a low score on the other 
variable. The closer a positive value of r is to +1, the stronger the direct relationship between 
the two variables; whereas the closer a positive value of r is to 0, the weaker the direct rela- 
tionship between the variables. Thus, when r is close to +1, most subjects who have a high score 
on one variable will have a comparably high score on the second variable, and most subjects who 
have a low score on one variable will have a comparably low score on the second variable. As 
the value of r approaches 0, the consistency of the general pattern described by a positive cor- 
relation deteriorates, until finally, when r = 0 there will be no consistent pattern that allows one 
to predict at above chance a subject's score on one variable if one knows the subject's score on 
the other variable. 

An indirect/inverse relationship is one in which a change on one variable is associated with 
achange on the other variable in the opposite direction (i.e., an increase on one variable is associ- 
ated with a decrease on the other variable, and a decrease on one variable is associated with an 
increase on the other variable). When there is an indirect linear relationship, subjects who have 
a high score on one variable will have a low score on the other variable, and vice versa. The 
closer a negative value of r is to - 1, the stronger the indirect relationship between the two 
variables, whereas the closer a negative value of r is to 0, the weaker the indirect relationship 
between the variables. Thus, when r is close to - 1, most subjects who have a high score on one 
variable will have a comparably low score on the second variable (i.e., as extreme a score in the 
opposite direction), and most subjects who have a low score on one variable will have a com- 
parably high score on the second variable. As the value of r approaches 0, the consistency of the 
general pattern described by a negative correlation deteriorates, until finally, when r = 0 there 
will be no consistent pattern that allows one to predict at above chance a subject's score on one 
variable if one knows the subject's score on the other variable. 

The use of the Pearson product-moment correlation coefficient assumes that a linear 
function best describes the relationship between the two variables. If, however, the relationship 
between the variables is better described by a curvilinear function, the value of r computed for 
a set of data may not indicate the actual extent of the relationship between the variables. In view 
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of this, when a computed r value is equal to or close to 0, a researcher should always rule out the 
possibility that the two variables are related curvilinearly. One quick way of assessing the 
likelihood of the latter is to construct a scatterplot of the data. A scatterplot, which is described 
in Section VI, displays the data for a correlational analysis in a graphical format. 

It is important to note that correlation does not imply causation. Consequently, if there is 
a strong correlation between two variables (i.e., the absolute value of ris close to 1), a researcher 
is not justified in concluding that one variable causes the other variable. Although it is possible 
that when a strong correlation exists one variable may, in fact, cause the other variable, the 
information employed in computing the Pearson product-moment correlation coefficient does 
not allow a researcher to draw such a conclusion. This is the case, since extraneous variables that 
have not been taken into account by the researcher can be responsible for the observed 
correlation between the two variables. 

The Pearson product-moment correlation coefficient is based on the following assump- 
tions: a) The sample of n subjects for which the value r is computed is randomly selected from 
the population it represents; b) The level of measurement upon which each of the variables is 
based is interval or ratio. Although this assumption is applicable to the conventional use of the 
Pearson product-moment correlation coefficient, there are special cases in which the equation 
for Pearson r can be employed with rank-order data (see Section VI of Spearman's rank-order 
correlation coefficient (Test 29)), and categorical data involving one or both variables (see the 
discussions of the phi coefficient (Test 16g) in Section VII, and the discussion of the point- 
biserial correlation coefficient (Test 28h) in Section IX (the Addendum)); c) The two variables 
have a bivariate normal distribution. The assumption of bivariate normality states that each 
of the variables and the linear combination of the two variables are normally distributed. With 
respect to the latter, if every possible pair of data points are plotted on a three-dimensional plane, 
the resulting surface (which will look like a mountain with a rounded peak) will be a three- 
dimensional normal distribution (i.e., a three-dimensional structure in which any cross-section 
is astandard normal distribution). Another characteristic of a bivariate normal distribution is that 
for any given value of the X variable, the scores on the Y variable will be normally distributed, 
and for any given value of the Y variable, the scores on the X variable will be normally 
distributed. In conjunction with the latter, the variances for the Y variable will be equal for each 
of the possible values of the X variable, and the variances for the X variable will be equal for each 
of the possible values of the Y variable; d) Related to the bivariate normality assumption is the 
assumption of homoscedasticity. Homoscedasticity exists in a set of data if the relationship 
between the X and Y variables is of equal strength across the whole range of both variables. 
Tabachnick and Fidell (1989) note that when the assumption of bivariate normality is met, the 
two variables will be homoscedastic. The concept of homoscedasticity is discussed in Section 
VII; and e) Another assumption of the Pearson product-moment correlation coefficient, 
referred to as nonautoregression, is discussed in many books on business and economics. This 
latter assumption, which is discussed within the framework of a special case of correlation 
referred to as autocorrelation, is only likely to be violated when pairs of numbers that are 
derived from a series of n numbers are correlated with one another. A discussion of 
autocorrelation can be found in Section VII. 


II. Example 
Example 28.1 A psychologist conducts a study employing a sample of five children to determine 
whether there is a statistical relationship between the number of ounces of sugar a ten-year-old 


child eats per week (which will represent the X variable) and the number of cavities in a child's 
mouth (which will represent the Y variable). The two scores (ounces of sugar consumed per week 
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and number of cavities) obtained for each of the five children follow: Child 1 (20, 7); Child 2 
(0, 0); Child 3 (1, 2); Child 4 (12, 5); Child 5 (3, 3). Is there a significant correlation between 


sugar consumption and the number of cavities? 
III. Null versus Alternative Hypotheses 


Upon computing the Pearson product-moment correlation coefficient, it is common practice 
to determine whether the obtained absolute value of the correlation coefficient is large enough 
to allow a researcher to conclude that the underlying population correlation coefficient between 
the two variables is some value other than zero. Section V describes how the latter hypothesis, 
which is stated below, can be evaluated through use of tables of critical r values or through use 
of an inferential statistical test that is based on the : distribution.” 


Null hypothesis H,: p = 0 


(In the underlying population the sample represents, the correlation between the scores of 
subjects on Variable X and Variable Y equals 0.) 


Alternative hypothesis H,: p #0 


(In the underlying population the sample represents, the correlation between the scores of 
subjects on Variable X and Variable Y equals some value other than 0. This is a nondirectional 
alternative hypothesis, and it is evaluated with a two-tailed test. Either a significant positive 
r value or a significant negative r value will provide support for this alternative hypothesis. In 
order to be significant, the obtained absolute value of r must be equal to or greater than the tabled 
critical two-tailed r value at the prespecified level of significance.) 


or 
H,: p > 0 


(In the underlying population the sample represents, the correlation between the scores of 
subjects on Variable X and Variable Y equals some value greater than 0. This is a directional 
alternative hypothesis, and it is evaluated with a one-tailed test. Only a significant positive r 
value will provide support for this alternative hypothesis. In order to be significant (in addition 
to the requirement of a positive r value), the obtained absolute value of r must be equal to or 
greater than the tabled critical one-tailed r value at the prespecified level of significance.) 


or 
H,: p < 0 


(In the underlying population the sample represents, the correlation between the scores of 
subjects on Variable X and Variable Y equals some value less than 0. This is a directional 
alternative hypothesis, and it is evaluated with a one-tailed test. Only a significant negative 
r value will provide support for this alternative hypothesis. In order to be significant (in addition 
to the requirement of a negative r value), the obtained absolute value of r must be equal to or 
greater than the tabled critical one-tailed r value at the prespecified level of significance.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected. 
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IV. Test Computations 


Table 28.1 summarizes the data for Example 28.1. The following should be noted with respect 
to Table 28.1: a) The number of subjects is n = 5. Each subject has an X score and a Y score, 
and thus there are five X scores and five Y scores; b) EX, XX?, XY, and, XY?, respectively, 
represent the sum of the five subjects’ scores on the X variable, the sum of the five subjects’ 
squared scores on the X variable, the sum of the five subjects’ scores on the Y variable, and the 
sum of the five subjects’ squared scores on the Y variable; and c) An XY score is obtained for 
each subject by multiplying a subject’s X score by the subject's Y score. XXY represents the sum 
of the five subjects’ XY scores. 


Table 28.1 Summary of Data for Example 28.1 


Subject X x? Y Y? XY 
1 20 400 7 49 140 
2 0 0 0 0 0 
3 1 1 2 4 2 
4 12 144 5 25 60 
5 3 9 3 9 9 


EX = 36 EX? = 554 ÈY = 17 Sy? = 87. EXY = 211 


Although they are not required for computing the value of r, the mean score (X and Y) and 
the estimated population standard deviation ($, and $,) for each of the variables are computed 
(the latter values are computed with Equation I.8). These values are employed in Section VI to 
derive regression equations, which are used to predict a subject's score on one variable from 
the subject's score on the other variable. 





Equation 28.1 (which is identical to Equation 17.7, except for the fact that the notations X 
and Y are used in place of X, and X,) is employed to compute the value of re 


r= 4 (Equation 28.1) 
rp EX llep c» 
n n 
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Substituting the appropriate values in Equation 28.1, the value r = .995 is computed. 


>11 8900 
p= P= 955 


554 - es ly em 


5 

The numerator of Equation 28.1, which is referred to as the sum of products (which is 
summarized with the notation SP,,,), will determine the sign of r. If the numerator is a nega- 
tive value, r will be a negative number. If the numerator is a positive value, r will be a positive 
number. If the numerator equals zero, r will equal zero. In the case of Example 28.1, 
SP yy = XXY - [(LX)(LY)/n] = 211 - [[(36)(17)/5] = 88.6. The denominator of Equation 28.1 
is the square root of the product of the sum of squares of the X scores (which is summarized 
with the notation SS%), and the sum of squares of the Y scores (which is summarized with the 
notation SS,). Thus, SS, = XX? - [QCX)/n] = 554 - [(36)/5] = 294.8 and SS, = XY? 
- [XY?/n] = 87 - [(07)//5] = 29.2. The aforementioned sum of squares values represent the 
numerator of the equation for computing the estimated population standard deviation of the X and 
Y scores (i.e., Equation I.8). Employing the notation for the sum of products and the sums of 
squares, the equation for the Pearson product-moment correlation coefficient can be expressed 
as follows: r = SPyy/,/SS,SS,. 

The reader should take note of the fact that each of the sum of squares values must be a 
positive number. If either of the sum of squares values is a negative number, it indicates that a 
computational error has been made. The only time a sum of squares value will equal zero, will 
be if all of the subjects have the identical score on the variable for which the sum of squares is 
computed. Anytime one or both of the sum of squares values equals zero, Equation 28.1 will 
be insoluble. It is noted in Section I that the computed value of r must fall within the range 
-1 <r<-+l. Consequently, if the value of r is less than - 1 or greater than +1, it indicates that 
a computational error has been made. 








V. Interpretation of the Test Results 


The obtained value r= .995 is evaluated with Table A16 (Table of Critical Values for Pearson 
r)in the Appendix. The degrees of freedom employed for evaluating the significance of r are 
computed with Equation 28.2. 


df=n-2 (Equation 28.2) 


Employing Equation 28.2, the value df= 5 - 3 = 2 is computed. Using Table A16, it can 
be determined that the tabled critical two-tailed r values at the .05 and .01 levels of significance 
are rg, = -878 and rg, = .959, and the tabled critical one-tailed r values at the .05 and .01 
levels of significance are rg; = .805 and ry, = .934. 

The following guidelines are employed in evaluating the null hypothesis Hj: p = 0. 

a) If the nondirectional alternative hypothesis H,: p * 0 is employed, the null hypothesis 
can be rejected if the obtained absolute value of r is equal to or greater than the tabled critical 
two-tailed value at the prespecified level of significance. 

b) If the directional alternative hypothesis H,: p > 0 isemployed, the null hypothesis can 
be rejected if the sign of r is positive, and the value of r is equal to or greater than the tabled 
critical one-tailed value at the prespecified level of significance. 
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c) If the directional alternative hypothesis H,: p < 0 is employed, the null hypothesis can 
be rejected if the sign of r is negative, and the absolute value of r is equal to or greater than the 
tabled critical one-tailed value at the prespecified level of significance. 

Employing the above guidelines, the nondirectional alternative hypothesis H,: p # 0 
is supported at the .05 level, since the computed value r = .955 is greater than the tabled critical 
two-tailed value rg, = .878. Itis not, however, supported at the .01 level, since r = .955 is less 
than the tabled critical two-tailed value ry, = .959. 

The directional alternative hypothesis H,: p > 0 is supported at both the .05 and .01 
levels, since the computed value r = .955 is a positive number that is greater than the tabled 
critical one-tailed values rg, = .805 and r,, = .934. 

The directional alternative hypothesis H,: p < 0 is not supported, since the computed 
value r = .955 is a positive number. In order for the alternative hypothesis H: p < 0 to be 
supported, the computed value of r must be a negative number (as well as the fact that the 
absolute value of r must be equal to or greater than the tabled critical one-tailed value at the 
prespecified level of significance). 

It may seem surprising that such a large correlation (i.e., an r value that almost equals 1) 
is not significant at the .01 level. Inspection of Table A16 reveals that when the sample size is 
small (as is the case in Example 28.1), the tabled critical r values are relatively large. The large 
critical values reflect the fact that the smaller the sample size, the higher likelihood of sampling 
error resulting in a spuriously inflated correlation. At this point it is worth noting that there are a 
number of factors which can dramatically influence the value of r, and such factors are much 
more likely to distort the computed value of a correlation coefficient when the sample size is 
small. The following are among those factors that can dramatically influence the value of r: 
a) If the range of scores on either the X or Y variable is restricted, the absolute value of r will 
be reduced; b) A correlation based on a sample which is characterized by the presence of extreme 
scores on one or both of the variables (even though the scores are not extreme enough to be 
considered outliers, which are atypically extreme scores) may be spuriously high (i.e., the 
absolute value of r will be higher than the absolute value of p in the underlying population); and 
c) The presence of one or more outliers can grossly distort the absolute value of r, or even affect 
the sign of r (outliers are discussed in detail in Section VII of the t test for two independent 
samples (Test 11)). 

Further examination of Table A16 reveals that as the value of n increases, the tabled critical 
values at a given level of significance decrease, until finally when n is quite large the tabled 
critical values are quite low. What this translates into is that when the sample size is extremely 
large, an absolute r value that is barely above zero will be statistically significant. Keep in mind, 
however, that the alternative hypothesis that is evaluated only stipulates that the underlying 
population correlation is some value other than zero. The distinction between statistical versus 
practical significance (which is discussed in Section VI of the ¢ test for two independent 
samples) is germane to this discussion, in that a small correlation may be statistically significant, 
yet not be of any practical and/or theoretical value. It should be noted, however, that in many 
instances a significant correlation which is close to zero may be of practical and/or theoretical 
significance. 


Test 28a: Test of significance for a Pearson product-moment correlation coefficient In the 


event a researcher does not have access to Table A16, Equation 28.3, which employs the t 
distribution, provides an alternative way of evaluating the null hypothesis Hj: p = 0. 
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po ES (Equation 28.3) 


Substituting the appropriate values in Equation 28.3, the value t 2 5.58 is computed. 


gu. NSE? oues 


1 - (.955) 


The computed value t = 5.58 is evaluated with Table A2 (Table of Student's t Dis- 
tribution) in the Appendix. The degrees of freedom employed in evaluating Equation 28.3 are 
df-n- 2. Thus, df=5 - 2=3. For df=3, the tabled critical two-tailed .05 and .01 values are 
tos = 3.18 and £4, = 5.84, and the tabled critical one-tailed .05 and .01 values are 
tos = 2.35 and ty, = 4.54. Since the sign of the t value computed with Equation 28.3 will 
always be the same as the sign of r, the guidelines described earlier in reference to Table A16 
for evaluating an r value can also be applied in evaluating the t value computed with Equation 
28.3 (i.e., substitute t in place of r in the text of the guidelines for evaluating r). 

Employing the guidelines, the nondirectional alternative hypothesis H,: p # 0 is supported 
at the .05 level, since the computed value f = 5.58 is greater than the tabled critical two-tailed 
value fg; = 3.18. It is not, however, supported at the .01 level, since t = 5.58 is less than the 
tabled critical two-tailed value ty, = 5.84. 

The directional alternative hypothesis H,: p > 0 is supported at both the .05 and .01 
levels, since the computed value f = 5.58 is a positive number that is greater than the tabled 
critical one-tailed values £9, = 2.35 and ty, = 4.54. 

The directional alternative hypothesis H,: p < O is not supported, since the computed 
value t = 5.58 is a positive number. In order for the alternative hypothesis H,: p < 0 to be 
supported, the computed value of t must be a negative number (as well as the fact that the 
absolute value of ¢ must be equal to or greater than the tabled critical one-tailed value at the 
prespecified level of significance). 

Note that the results obtained through use of Equation 28.3 are consistent with those that 
are obtained when Table A16 is employed. A summary of the analysis of Example 28.1 
follows: It can be concluded that there is a significant positive correlation between the number 
of ounces of sugar a ten-year-old child eats and the number of cavities in a child's mouth. This 
result can be summarized as follows (if itis assumed the nondirectional alternative hypothesis H,: p # 0 
is employed): r 2.955, p < .05. 


The coefficient of determination The square of a computed r value (i.e., 7?) is referred to as 
the coefficient of determination. r? represents the proportion of variance on one variable that 
can be accounted for by variance on the other variable? The use of the term "accounted for" in 
the previous sentence should not be interpreted as indicating that a cause-effect relationship 
exists between the two variables. As noted in Section I, a substantial correlation between two 
variables does not allow one to conclude that one variable causes the other. 

For Example 28.1 the coefficient of determination is computed to be r? = (.955)* =.912, 
which expressed as a percentage is 91.296. This indicates that 91.2% of the variation on the 
X variable can be accounted for on the basis of variability on the Y variable (or vice versa). 
Although it is possible that X causes Y (or that Y causes X), it is also possible that one or more 
extraneous variables which are related to X and/or Y, which have not been taken into account in 
the analysis, are the real reason for the strong relationship between the two variables. In order 
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to demonstrate that the amount of sugar a child eats is the direct cause of the number of cavities 
he or she develops, a researcher would be required to conduct an experiment in which the amount 
of sugar consumed is a manipulated independent variable, and the number of cavities is the 
dependent variable. As noted in the Introduction of the book, an experiment in which the 
independent variable is directly manipulated by the researcher is often referred to as a true 
experiment. If a researcher conducts a true experiment to evaluate the relationship between the 
amount of sugar eaten and the number of cavities, such a study would require randomly assigning 
a representative sample of young children to two or more groups. By virtue of random assign- 
ment, it would be assumed that the resulting groups are comparable to one another. Each of the 
groups would be differentiated from one another on the basis of the amount of sugar the children 
within a group consume. Since the independent variable is manipulated, the amount of sugar 
consumed by each group is under the direct control of the experimenter. Any observed differ- 
ences on the dependent variable between the groups at some later point in time could be 
attributed to the manipulated independent variable. Thus, if, in fact, significant group differences 
with respect to the number of cavities are observed, the researcher would have a reasonable basis 
for concluding that sugar consumption is responsible for such differences. 

Whereas the correlational study represented by Example 28.1 is not able to control for 
potentially confounding variables, the true experiment described above is able control for such 
variables. Common sense suggests, however, that practical and ethical considerations would 
make it all but impossible to conduct the sort of experiment described above. Realistically, in 
a democratic society a researcher cannot force a parent to feed her child a specified amount of 
sugar if the parent is not naturally inclined to do so. Even if a researcher discovers that through 
the use of monetary incentives she can persuade some parents to feed their children different 
amounts of sugar than they deem prudent, the latter sort of inducement would most likely 
compromise a researcher's ability to randomly assign subjects to groups, not to mention the fact 
that it would be viewed as unethical by many people. Consequently, if a researcher is inclined 
to conduct a study evaluating the relationship between sugar consumption and the number of 
cavities, it is highly unlikely that sugar consumption would be employed as a manipulated 
independent variable. In order to assess what, if any, relationship there is between the two 
variables, it is much more likely that a researcher would solicit parents whose children ate large 
versus moderate versus small amounts of sugar, and use the latter as a basis for defining her 
groups. In such a study, the amount of sugar consumed would be a nonmanipulated independent 
variable (since it represents a preexisting subject characteristic). The information derived from 
this type of study (which is commonly referred to as an ex post facto study or a natural 
experiment) is correlational in nature. This is the case, since in any study in which the 
independent variable is not manipulated by the experimenter, one is not able to effectively 
control for the influence of potentially confounding variables. Thus, if, in fact, differences are 
observed between two or more groups in an ex post facto study, although such differences may 
be due to the independent variable, they can also be due to extraneous variables. Consequently, 
in the case of the example under discussion, any observed differences in the number of cavities 
between two or more groups can be due to extraneous factors such as maternal prenatal health 
care, different home environments, dietary elements other than sugar, socioeconomic and/or 
educational differences between the families that comprise the different groups, etc. 


VI. Additional Analytical Procedures for the Pearson Product- 
Moment Correlation Coefficient and/or Related Tests 


1. Derivation of a regression line The obtained value r = .955 suggests a strong direct 
relationship between sugar consumption (X) and the number of cavities (Y). The high positive 
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value of the correlation coefficient suggests that as the number of ounces of sugar consumed 
increases, there is a corresponding increase in the number of cavities. This is confirmed in 
Figure 28.1 which is a scatterplot of the data for Example 28.1. A scatterplot depicts the data 
employed in a correlational analysis in a graphical format. Each subject's two scores are 
represented by a single point on the scatterplot. The point that depicts a subject's two scores is 
arrived at by moving horizontally on the abscissa (X-axis) the number of units that corresponds 
to the subject's X score, and moving vertically on the ordinate (Y-axis) the number of units that 
corresponds to the subject's Y score. 

Employing the scatterplot, one can visually estimate the straight line that comes closest 
to passing through all of the data points. This line is referred to as the regression line (also 
known as the line of best fit). In actuality, there are two regression lines. The line commonly 
determined is the regression line of Y on X. The latter line is employed to predict a subject's 
Y score (which represents the criterion variable) by employing the subject's X score (which 
represents the predictor variable). The second regression line, the regression line of X on Y, 
allows one to predict a subject's X score by employing the subject's Y score. As will be noted 
later in this discussion, the only time the two regression lines will be identical is when the 
absolute value of r equals 1. Because X is usually designated as the predictor variable and Y as 
the criterion variable, the regression line of Y on X is the more commonly determined of the two 
regression lines. 


Y = Number of cavities 5 





0 5 10 15 20 


X= Ounces of sugar 
Figure 28.1 Scatterplot for Example 28.1 


The regression line of Y on X (which, along with the regression line of X on Y, is 
determined mathematically later in this section) has been inserted in Figure 28.1. Note that the 
line is positively sloped — i.e., the lowest part of the line is on the lower left of the graph with 
the line slanting upward to the right. A line that is positively sloped reflects the fact that a change 
on one variable in a specific direction is accompanied by a change in the other variable in the 
same direction. A positive correlation will always result in a positively sloped regression line. 
A negative correlation, on the other hand, will always result in a negatively sloped regression line. 
In a negatively sloped regression line, the upper part of the line is at the left of the graph and the 
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line slants downward as one moves to the right. A line that is negatively sloped reflects the fact 
that a change on one variable in a specific direction is accompanied by a change in the other 
variable in the opposite direction. 

Whereas the slope of the regression line indicates whether a computed r value is a positive 
or negative number, the magnitude of the absolute value of r reflects how close the n data points 
fall in relation to the regression line. When r = +1 or r = - 1, all of the data points fall on the 
regression line. As the absolute value of r deviates from 1 and moves toward 0, the data points 
deviate further and further from the regression line. Figure 28.2 depicts a variety of hypothetical 
regression lines, which are presented to illustrate the relationship between the sign and absolute 
value of r and the regression line. 

In Figure 28.2 the regression lines (a), (b), (c), and (d) are positively sloped, and are thus 
associated with a positive correlation. Lines (e), (f), (g), and (h), on the other hand, are 
negatively sloped, and are associated with a negative correlation. Note that in each graph, the 
closer the data points are to the regression line, the closer the absolute value of r is to one. Thus, 
in graphs (a)-(h), the strength of the correlation (i.e., maximum, strong, moderate, weak) is a 
function of how close the data points are to the regression line. 

The use of the terms strong, moderate, and weak in relation to specific values of correlation 
coefficients is somewhat arbitrary. For the purpose of discussion the following rough guidelines 
will be employed for designating the strength of a correlation coefficient: a) If |r| > .7, a 
correlation is considered to be strong; b) If .3 « |r| « .7, a correlation is considered to be 
moderate; and c) If |r| < .3, a correlation is considered to be weak. In point of fact, most 
statistically significant correlations in the scientific literature are in the weak to moderate range. 
As noted earlier, although such correlations are not always of practical and/or theoretical impor- 
tance, there are many instances where they are. 

Graphs (1) and (j) in Figure 28.2 depict data which result in a correlation of zero, since in 
both instances the distribution of data points is random and, consequently, a straight line cannot 
be used to describe the relationship between the two variables with any degree of accuracy. 
Whenever the Pearson-moment correlation coefficient equals zero, the regression line will be 
parallel to either the X-axis (as in Graph (1)) or the Y-axis (as in Graph (j)), depending upon 
which regression line is drawn. 

Two other instances in which the regression line is parallel to the X-axis or the Y-axis are 
depicted in Graphs (k) and (1). Both of these graphs depict data for which a value of r cannot be 
computed. The data depicted in graphs (k) and (1) illustrate that in order to compute a coefficient 
of correlation, there must be variability on both the X and the Y variables. Specifically, in Graph 
(k) the regression line is parallel to the X-axis. The configuration of the data upon which this 
graph is based indicates that, although there is variability with respect to subjects’ scores on the 
X variable, there is no variability with respect to their scores on the Y variable — i.e., all of the 
subjects obtain the identical score on the Y variable. As a result of the latter, the computed value 
for the estimated population variance for the Y variable will equal zero. When, in fact, the value 
of the variance for the Y variable equals zero, the sum of squares of the Y scores will equal zero 
(i.e., SS, = XY? - [GCY?/n] = 0). The sum of squares of the Y scores (which is the num- 
erator of the equation for computing the estimated population variance of the Y scores) is, as 
noted earlier, one of the elements that comprises the denominator of Equation 28.1. Con- 
sequently, if the sum of squares of the Y scores equals zero, the latter equation becomes 
insoluble. Note that if the regression line depicted in Graph (k) is employed to predict a subject's 
Y score from the subject's X score, all subjects are predicted to have the same score. If, in fact, 
all subjects have the same Y score, there is no need to employ the regression line to make a 
prediction. 
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Figure 28.2 Hypothetical Regression Lines 
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Graph (1) illustrates a regression line that is parallel to the Y-axis. The configuration of the 
data upon which the latter graph is based indicates that, although there is variability with respect 
to subjects’ scores on the Y variable, there is no variability with respect to their scores on the 
X variable — i.e., all of the subjects obtain the identical score on the X variable. As a result of 
the latter, the computed value for the estimated population variance for the X variable will equal 
zero. When, in fact, the value of the variance for the X variable equals zero, the sum of squares 
of the X scores will equal zero (i.e., SS, = XX? - [GCX)?/n] = 0). The sum of squares of the 
X scores (which is the numerator of the equation for computing the estimated population variance 
of the X scores) is, as noted earlier, one of the elements that comprises the denominator of 
Equation 28.1. Consequently, if the sum of squares of the X scores equals zero, the latter 
equation becomes insoluble. Note that if the regression line depicted in Graph (1) is employed 
to predict a subject's X score from the subject's Y score, all subjects are predicted to have the 
same X score. If, in fact, all subjects have the same X score, there is no need to employ the 
regression line to make a prediction. 

If both the X and Y variable have no variability (1.e., all subjects obtain the identical score 
on the X variable and all subjects obtain the identical score on the Y variable), the resulting 
graph will consist of a single point (which is the case for Graph (m)). Thus, the single point in 
Graph (m) indicates that each of the n subjects in a sample obtains identical scores on both the 
X and Y variables. 

At this point in the discussion, the role of the slope of a regression line will be clarified. 
The slope of a line indicates the number of units the Y variable will change if the X variable is 
incremented by one unit. This definition for the slope is applicable to the regression line of Y 
on X. The slope of the regression line of X on Y, on the other hand, indicates the number of 
units the X variable will change if the Y variable is incremented by one unit. The discussion to 
follow will employ the definition of the slope in reference to the regression line of Y on X. 

A line with a large positive slope or large negative slope is inclined in an upward direction 
away from the X-axis — i.e., like a hill with a high grade. The more the magnitude of the 
positive slope increases, the more the line approaches being parallel to the Y-axis. A line with 
a small positive slope or small negative slope has a minimal inclination in relation to the X-axis 
— i.e., like a hill with a low grade. The smaller the slope of a line, the more the line approaches 
being parallel to the X-axis. The graphs in Figure 28.3 reflect the following degrees of slope: 
Graphs (a) and (b), respectively, depict lines with a large positive slope and a large negative 
slope; Graphs (c) and (d), respectively, depict lines with a moderate positive slope and a 
moderate negative slope (1.e., the severity of the angle in relation to the X-axis is in between that 
of a line with a large slope and a small slope); Graphs (e) and (f), respectively, depict lines with 
a small positive slope and a small negative slope. 

Itis important to keep in mind that although the slope of the regression line of Y on X plays 
a role in determining the specific value of Y that is predicted from the value of X, the magnitude 
of the slope is not related to the magnitude of the absolute value of the coefficient of correlation. 
A regression line with a large slope can be associated with a correlation coefficient that has a 
large, moderate, or small absolute value. In the same respect, a regression line with a small slope 
can be associated with a correlation coefficient that has a large, moderate, or small absolute value. 
Thus, the accuracy of a prediction is not a function of the slope of the regression line. Instead, it 
is a function of how far removed the data points are from the regression line. To illustrate this 
point, let us assume that a regression line which has a large positive slope (such as Graph (a) in 
Figure 28.3) is being used to predict Y scores for a set of X scores that are one unit apart from one 
another. As the magnitude of an X score increases by one unit, there is a sizeable increase in the 
Y score predicted for each subsequent value of X. In the opposite respect, if the regression line 
has a small positive slope (such as Graph (e) in Figure 28.3), as the magnitude of an X score 
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Figure 28.3 Hypothetical Regression Lines 


increases by one unit, there is a minimal increase in the Y score predicted for each subsequent 
value of X. It is important to note, however, that in both of the aforementioned examples, 
regardless of whether the slope of the regression line is large or small, the accuracy of the 
predicted Y scores will not be affected by the magnitude of the slope of the line. Consequently, 
for any of the regression lines depicted in Figure 28.3, the n data points can fall on, close to, or 
be far removed from the regression line. 


Mathematical derivation of the regression line The most accurate way to determine the 
regression line is to compute, through use of a procedure referred to as the method of least 
squares, the equation of the straight line that comes closest to passing through all of the data 
points. As noted earlier, in actuality there are two regression lines — the regression line of Y 
on X (which is employed to predict a subject's Y score by employing the subject' s X score), and 
the regression line of X on Y (which is employed to predict a subject's X score by employing 
the subject’s Y score). The equations for the two regression lines will always be different, except 
when the absolute value of r equals 1. When |r| = 1, the two regression lines are identical (both 
visually and algebraically). The reason why the two regression lines are always different (except 
when |r| = 1) is because the regression line of Y on X is based on the equation that results in the 
minimum squared distance of all the data points from the line, when the distance of the points 
from the line is measured vertically (i.e., | or 1). On the other hand, the regression line of X on 
Y is based on the minimum squared distance of the data points from the line, when the distance 
of the points from the line is measured horizontally (i.e., > or -). Since when |r| = 1 all the 
points fall on the regression line, both the vertical and horizontal squared distance for each data 
point equals zero. Consequently, when |r| = 1 the two regression lines are identical. 
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The regression line of Y on X is determined with Equation 28.4. 
Y' =a, + b,X (Equation 28.4) 


Where: Y' represents the predicted Y score for a subject 
X represents the subject's X score that is used to predict the value Y’ 
ay represents the Y intercept, which is the point at which the regression line crosses 
the Y-axis 
b, represents the slope of the regression line of Y on X 


In order to derive Equation 28.4, the values of b, and a, must be computed. Either 


Equation 28.5 or Equation 28.6 can be employed to compute the value b,. The latter equations 
are employed below to compute the value b, = .30. 


sxy- EXE) — 44, 0907» 


SP 
b, - Xe 207. ?  . 30 (Equation 28.5) 
n 
$ 2.70 : 
b, = r| —| -(955| ==] = .30 Equation 28.6 
Y = C | 22) (Eq ) 





Equation 28.7 is employed to compute the value a,. The latter equation is employed below 
to compute the value ay = 1.24. 


Gy Y - byX = 3.4 - (30(72) = 124 (Equation 28.7) 


Substituting the values a, = 1.24 and b, = .30 in Equation 28.4, we determine that the 
equation for regression line of Y on X is Y' = 1.24 + .3X. Since two points can be used to 
construct a straight line, we can select two values for X and substitute each value in Equation 
28.4, and solve for the values that would be predicted for Y'. Each set of values that is comprised 
of an X score and the resulting Y' value will represent one point on the regression line. Thus, if 
we plot any two points derived in this manner and connect them, the resulting line is the 
regression line of Y on X. To demonstrate this, if the value X = 0 is substituted in the regression 
equation, it yields the value Y' = 1.24 (which equals the value of a,): Y' = 1.24 - (.30)(0) = 
1.24. Thus, the first point that will be employed in constructing the regression line is (0, 1.24). 
If we next substitute the value X = 5 in the regression equation, it yields the value Y' = 2.74: Y' 
= 1.24 + (.30)(5) = 2.74. Thus, the second point to be used in constructing the regression line 
is (5, 2.74). The regression line that results from connecting the points (0, 1.24) and (5, 2.74) 
is displayed in Figure 28.4. 

If the researcher wants to predict a subject's score on the Y variable by employing the 
subject's score on the X variable, the predicted value Y' can be derived either from the regression 
equation or from Figure 28.4. If the regression equation is employed, the value Y' is derived by 
substituting a subject's X score in the equation (which is the same procedure that is employed to 
determine the two points that are used to construct the regression line). Thus, if a child consumes 
ten ounces of sugar per week, employing X = 10 in the regression equation, the predicted number 
of cavities for the child is Y' = 1.24 + (.30)(10) = 4.24 = 4.24. In using Figure 28.4 to predict 
the value of Y', we identify the point on the X-axis which corresponds to the subject's score on 
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Figure 28.4 Regression Line of Y on X for Example 28.1 


the X variable. A perpendicular line is erected from that point until it intersects the regression 
line. Atthe point the perpendicular line intersects the regression line, a second perpendicular line 
is dropped to the Y-axis. The point at which the latter perpendicular line intersects the Y-axis 
corresponds to the predicted value Y'. This procedure, which is illustrated in Figure 28.4, yields 
the same value Y' = 4.24, which is obtained when the regression equation is employed. 

The regression line of X on Y is determined with Equation 28.8. 


xY (Equation 28.8) 


Where:  X' represents the predicted X score for a subject 
Y represents the subject's Y score which is used to predict the value X’ 
ay represents the X intercept, which is the point at which the regression line crosses 
the X-axis 
by represents the slope of the regression line of X on Y 


In order to derive Equation 28.8, the values of b, and a, must be computed. Either 


Equation 28.9 or Equation 28.10 can be employed to compute the value b,. The latter equations 
are employed below to compute the value b, = 3.03. 


n 


SP 
len ego e LL — 3 . = 3.03 (Equation 28.9) 
n 5 
$ 8.58 : 
b, = r| | = (955) 2| = 3.03 Equation 28.10 
= B ( f = (Eq ) 


Equation 28.11 is employed to compute the value ay. The latter equation is employed 
below to compute the value a, = -3.10. 


a, = X - b,Y = 7.2 - (8.03)8.4) = -3.10 (Equation 28.11) 


Substituting the values a, = -3.10 and by = 3.03 in Equation 28.8, we determine that the 
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equation for regression line of X on Y is X' = -3.10 + 3.03Y. Since two points can be used to 
construct a straight line, we can select two values for Y and substitute each value in Equation 
28.8, and solve for the values that would be predicted for X'. Each set of values that is comprised 
of a Y score and the resulting X' value will represent one point on the regression line. Thus, if 
we plot any two points derived in this manner and connect them, the resulting line is the 
regression line of X on Y. To demonstrate this, if the value Y = 0 is substituted in the regression 
equation, it yields the value X’ = -3.10 (which equals the value of a,): X’ = -3.10  (3.03)(0) 
— -3.10. Thus, the first point that will be employed in constructing the regression line is (—3.10, 
0). If we next substitute the value Y = 5 in the regression equation, it yields the value X’ = 12.05: 
X' = -3.10 + (3.03)(5) = 12.05. Thus, the second point to be used in constructing the regression 
line is (12.05, 5). The regression line that results from connecting the points (—3.10, 0) and 
(12.05, 5) is displayed in Figure 28.5. Note that since the value Y = 0 results in a negative X 
value, the X-axis in Figure 28.5 must be extended to the left of the origin in order to 
accommodate the value X = -3.10. 
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Figure 28.5 Regression Line of X on Y for Example 28.1 


If the researcher wants to predict a subject's score on the X variable by employing the 
subject's score on the Y variable, the predicted value X' can be derived either from the regression 
equation or from Figure 28.5. If the regression equation is employed, the value X' is derived by 
substituting a subject's Y score in the equation (which is the same procedure that is employed to 
determine the two points that are used to construct the regression line). Thus, if a child has four 
cavities, employing Y= 4 in the regression equation, the predicted number of ounces of sugar the 
child eats per week is X' = -3.10 + (3.03)(4) = 9.02. In using Figure 28.5 to predict the value 
of X', we identify the point on the Y-axis which corresponds to the subject's score on the Y 
variable. A perpendicular line is erected from that point until it intersects the regression line. 
At the point the perpendicular line intersects the regression line, a second perpendicular line is 
dropped to the X-axis. The point at which the latter perpendicular line intersects the X-axis 
corresponds to the predicted value X'. This procedure, which is illustrated in Figure 28.5, yields 
the same value X' 2 9.02, which is obtained when the regression equation is employed. 

The protocol described in this section for deriving a regression equation does not provide 
any information regarding the accuracy of prediction that will result from such an equation. The 
standard error of estimate, which is discussed in the next section, is used as an index of accur- 
acy in regression analysis. The standard error of estimate is a function of a set of n deviation 
scores that are referred to as residuals. A residual is the difference between the predicted value 
of the criterion variable for a subject (1.e., Y' or X"), and a subject’s actual score on the criterion 
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variable (i.e., Y or X). A discussion of the role of residuals in regression analysis can be found 
in Section VII. 

In closing the discussion of the derivation of a regression line, it is important to emphasize 
that in some instances where the value of r is equal to or close to zero, there may actually be a 
curvilinear relationship between the two variables. When the absolute value of r is such that 
there is a weak to moderate relationship between the variables, if, in fact, a curvilinear function 
best describes the relationship between the variables, it will provide a more accurate basis for 
prediction than will the straight line derived through use of the method of least squares. One 
advantage of constructing a scatterplot is that it allows a researcher to visually assess whether or 
not a curvilinear function is more appropriate than a straight line in describing the relationship 
between the variables. If the latter is true, the researcher should derive the equation for the 
appropriate curve. Although the derivation of equations for curvilinear functions will not be 
described in this book, it is discussed in many books on correlation and regression. 


2. The standard error of estimate The standard error of estimate is a standard deviation 
of the distribution of error scores employed in regression analysis. More specifically, it is an 
index of the difference between the predicted versus the actual value of the criterion variable. 
The standard error of estimate for the regression line of Y on X (which is represented by the 
notation s, y) represents the standard deviation of the values of Y for a specific value of X. The 
standard error of estimate for the regression line of X on Y (which is represented by the notation 
Sy y) represents the standard deviation of the values of X for a specific value of Y. Thus, in 
Example 28.1, 5, y represents the standard deviation for the number of cavities of any subject 
whose weekly sugar consumption is equal to a specific number of ounces. sy y, on the other 
hand, represents the standard deviation for the number of ounces of sugar consumed by any 
subject who has a specific number of cavities. 

The standard error of estimate can be employed to compute a confidence interval for the 
predicted value of Y (or X). The larger the value of a standard error of estimate, the larger will 
be the range of values that define the confidence interval and, consequently, the less likely it is 
that the predicted value Y’ (or X’) will equal or be close to the actual score of a given subject on 
that variable. 

Equations 28.12 and 28.13 are, respectively, employed to compute the values s, y and Sy y 











(which are estimates of the underlying population parameters oy y and oy y). 
m n-1 2 . 
Sy x = Sy 7 su - r^] (Equation 28.12) 
zx n-1 2 . 
Poco at -r] (Equation 28.13) 





As the size of the sample increases, the value (n - 1)/(n - 2) in the radical of Equations 
28.12 and 28.13 approaches 1, and thus for large sample sizes the equations simplify to 
Syy = $yVl - r? and Sy y = Syl - r?. Note, however, that for small sample sizes the latter 
equations underestimate the values of s, y and s, y- 

Equations 28.12 and 28.13 are employed to compute the values s, , = .92 and 
Sy y = 2.943 





i= zw ERR - (955)] = .92 
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3. Computation of a confidence interval for the value of the criterion variable’ It turns out 
that Equations 28.12 and 28.13 are not unbiased estimates of error throughout the full range of 
values the criterion variable may assume. What this translates into is that if a researcher wants 
to compute a confidence interval with respect to a specific subject’s score on the criterion vari- 
able, in the interest of complete accuracy an adjusted standard error of estimate value should be 
employed. The adjusted standard error of estimate values will be designated $, y and Sy y. The 
values computed for $, y and S, y will always be larger than the values computed for s, y and 
Sy y. The larger the deviation between a subject's score on the predictor variable and the mean 
score for the predictor variable, the greater the difference between the values sy y versus Sy y 
and sy y versus $y y. Equations 28.14 and 28.15 are employed to compute the values $, , and 
Sy e In the latter equations, the values X and Y, respectively, represent the X and Y scores of 
the specific subject for whom the standard error of estimate is computed. 





i - (955y] = 2.94 


_ yy 
Syy = Syy |l + 1 * AD (Equation 28.14) 
X 
_yy 
Sy y = Syy |] + 1 * €n (Equation 28.15) 
Y 


At this point two confidence intervals will be computed employing the values $, , and 
Sy y. The two confidence intervals will be in reference to the two subjects for whom the values 
Y' = 4.24 and X' = 9.02 are predicted in the previous section (employing Equations 28.4 and 
28.8). Initially, the use of Equation 28.14 will be demonstrated to compute a confidence interval 
for the subject who consumes 10 ounces of sugar (i.e., X = 10), and (through use of Equation 
28.4) is predicted to have Y' = 4.24 cavities. Equation 28.16 is employed to compute a con- 
fidence interval for the predicted value of Y. 


Choy = Y E Cyn) Sy x) (Equation 28.16) 

Where: t, represents the tabled critical two-tailed value in the t distribution, for df = n - 2, 
below which a proportion (percentage) equal to [1 - (a/2)] of the cases falls. If the 
proportion (percentage) of the distribution that falls within the confidence interval is 
subtracted from 1 (10096), it will equal the value of a. 


In the computation of a confidence interval, the predicted value Y' can be conceptualized 
as the mean value in a population of scores on the Y variable for a specific subject. When the 
sample size employed for the analysis is large (i.e., n > 100), one can assume that the shape of 
such a distribution for each subject will be normal and, in such a case, the relevant tabled critical 
two-tailed z value (i.e., z,,,) can be employed in Equation 28.16 in place of the relevant tabled 
critical t value. For smaller sample sizes (as is the case for Example 28.1), however, the t dis- 
tribution provides a more accurate approximation of the underlying population distribution. Use 
of the normal distribution with small sample sizes underestimates the range of values that define 
a confidence interval. Inspection of Equation 28.16 reveals that the range of values computed 
for a confidence interval is a function of the magnitude of the standard error of estimate and the 
tabled critical t value (the magnitude of the latter being inversely related to the sample size). 
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In order to use Equation 28.16, the value $, y must be computed with Equation 28.14. Em- 
ploying Equation 28.14, the value $, , - 1.02 is computed. Note that the latter value is slightly 
larger than the value sy , = .92 computed with Equation 28.12. 


To demonstrate the use of Equation 28.16, the 95% confidence interval will be computed. 
The value £,, = 3.18 is employed to represent f,,,, since in Table A2 it is the tabled critical 
two-tailed .05 £ value for df=3. The appropriate values are now substituted in Equation 28.16 
to compute the 95% confidence interval. 


Clg = 424 + (3.18)(1.02) = 424 + 3.24 


This result indicates that the researcher can be 95% confident (or the probability is .95) that 
the number of cavities the subject actually has falls within the range 1.00 and 7.48 (i.e., 1.00 < 
Y < 7.48). 

A confidence interval will now be computed for the subject who has 4 cavities (i.e., Y= 4), 
and (through use of Equation 28.8) is predicted to eat 9.02 ounces of sugar. Equation 28.17 is 
employed to compute a confidence interval for the value of X. 


Cl, , = X' E yn) Sy y) (Equation 28.17) 


In order to use Equation 28.17 the value 5, , must be computed with Equation 28.15. 
Employing Equation 28.15, the value $, , - 3.24 is computed. Note that the latter value is 
slightly larger than the value sy , = 2.94 computed with Equation 28.13. 


p 2 
<5 200d 1s oe i gy 
| 5 ^ 292 


As is done in the previous example, the 95% confidence interval will be computed. Thus, 
the values £9, = 3.18 and $,, = 3.24 are substituted in Equation 28.17. 


CI,, = 9.02 + (3.18)(3.24) = 9.02 + 10.30 
This result indicates that the researcher can be 95% confident (or the probability is .95) that 
the number of ounces of sugar the subject actually eats falls within the range —1.28 and 19.32 
(i.e., - 1.28 < X < 19.32 < 19.32). Since it is impossible to have a negative number of ounces 
of sugar, the result translates into between 0 and 19.32 ounces of sugar. 


4. Computation of a confidence interval for a Pearson product-moment correlation coef- 
ficient In order to compute a confidence interval for a computed value of the Pearson product- 
moment correlation coefficient, it is necessary to employ a procedure developed by Fisher 
(1921) referred to as Fisher's z, (or z) transformation. The latter procedure transforms an 
r value to a scale that is based on the normal distribution. The rationale behind the use of 
Fisher’s z, transformation is that although the theoretical sampling distribution of the correlation 
coefficient can be approximated by the normal distribution when the value of a population 
correlation is equal to zero, as the value of the population correlation deviates from zero, the 
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sampling distribution becomes more and more skewed. Thus, in computing confidence intervals 
(as well as in testing hypotheses involving one or more populations in which a hypothesized 
population correlation is some value other than zero), Fisher's z, transformation is required to 
transform a skewed sampling distribution into a normalized format. 

Equation 28.18 is employed to convert an r value into a Fisher transformed value, which 
is represented by the notation z,. 








ee x 1 (Equation 28.18) 


Where: In represents the natural logarithm of a number (which is defined in Endnote 5 in 
the Introduction) 


Although logarithmic values can be computed with a function key on most scientific 
calculators, if one does not have access to a calculator, Table A17 (Table of Fisher's z, 
Transformation) in the Appendix provides an alternative way of deriving the Fisher 
transformed values. The latter table contains the z, values that correspond to specific values of 
r. 'The reader should take note of the fact that in employing Equation 28.18 or Table A17, the 
sign assigned to a z, value is always the same as the sign of the r value upon which it is based. 
Thus, a positive r value will always be associated with a positive z, value, and a negative r value 
will always be associated with a negative z, value. When r = 0, z, will also equal zero. 

Equation 28.19 is employed to compute the confidence interval for a computed r value. 


CI, =Z, + (n —4 (Equation 28.19) 
Td - a) n - 


Z4/. Tepresents the tabled critical two-tailed value in the normal distribution below 
which a proportion (percentage) equal to [1 - (a/2)] of the cases fall. If the 
proportion (percentage) of the distribution that falls within the confidence interval is 
subtracted from 1 (100%), it will equal the value of a. 


Where: 


The value y1/(n - 3) in Equation 28.19 represents the standard error of z,. In employing 
Equation 28.19 to compute the 9596 confidence interval, the product of the tabled critical two- 
tailed .05 z value and the standard error of z, are added to and subtracted from the Fisher 
transformed value for the computed r value. The two resulting values, which represent z, values, 
are then reconverted into correlation coefficients through use of Table A17 or by reconfiguring 
Equation 28.18 to solve for r." Use of Table A17 for the latter is accomplished by identifying 
the r values which correspond to the computed z, values. The resulting r values derived from 
the table identify the limits that define the 95% confidence interval. 

Equation 28.19 will now be used to compute the 95% confidence interval for r = .955. 
From Table A17 it is determined that the Fisher transformed value which corresponds to 
r = .955 is z, = 1.886. The latter value can also be computed with Equation 28.18: 
z, = (/2)ln[(1 + .955)/(1 - .955)] = 1.886. The appropriate values are now substituted in 
Equation 28.19. 


CI = 1.886 + (1.96) L3 n 1.886 + 1.386 


Zr o5) 5-3 
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Subtracting from and adding 1.386 to 1.886, yields the values .5 and 3.272. The latter 
values are now converted into r values through use of Table A17. By interpolating, we can 
determine that a z, value of .5 corresponds to the value r = .462, which will define the lower 
limit of the confidence interval. Since the value z, - 3.272 is substantially above the z value 
that corresponds to the largest tabled r value, it will be associated with the value r= 1. Thus, we 
can be 95% confident (or the probability is .95) that the true value of the population correlation 
falls between .462 and 1. Symbolically, this can be written as follows: .462 < p < 1. Note that 
because of the small sample size employed in the experiment, the range of values that define the 
confidence interval is quite large. 

If the 99% confidence interval is computed, the tabled critical two-tailed .01 value 
Zo, = 2.58 is employed in Equation 28.19 in place of zo, = 1.96. As is always the case in 
computing a confidence interval, the range of values that defines a 99% confidence interval will 
be larger than the range which defines a 95% confidence interval. 


5. Test 28b: Test for evaluating the hypothesis that the true population correlation is a 
specific value other than zero In certain instances, a researcher may want to evaluate whether 
an obtained correlation could have come from a population in which the true correlation between 
two variables is a specific value other than zero. The null and alternative hypotheses that are 
evaluated under such conditions are as follows. 


Hy P = Po 


(In the underlying population the sample represents, the correlation between the scores of 
subjects on Variable X and Variable Y equals py.) 


H: P # Po 
(In the underlying population the sample represents, the correlation between the scores of 
subjects on Variable X and Variable Y equals some value other than pọ. The alternative 
hypothesis as stated is nondirectional, and is evaluated with a two-tailed test. Itis also possible 


to state the alternative hypothesis directionally (H,: p > pọ or H,: P < pg), in which case 
it is evaluated with a one-tailed test.) 


Equation 28.20 is employed to evaluate the null hypothesis H,: p = Py. 


Z = ——— (Equation 28.20) 
(n - 3) 


Where: z, represents the Fisher transformed value of the computed value of r 
Zp, represents the Fisher transformed value of p), the hypothesized population 
0 
correlation 


Equation 28.20 will now be employed in reference to Example 28.1. Let us assume that we 
want to evaluate whether the true population correlation between the number of ounces of sugar 
consumed and the number of cavities is .80. Thus, the null hypothesis is Hy: p = .80, and the 
nondirectional alternative hypothesis is H,: p # .80. 

By employing Table A17 (or Equation 28.18), we determine that the corresponding z, 
values for the obtained correlation coefficient r = .955 and the hypothesized population cor- 
relation coefficient p — .80 are, respectively, z, - 1.886 and £v 1.099 (the notation Z, is 
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employed in place of z, whenever the relevant element in an equation identifies a population 
correlation). Substituting the Fisher transformed values in Equation 28.20, the value z = 1.11 is 
computed.” 


_ 1,886 - 1.099 _ 1 4, 


1 
6-3 





The computed value z = 1.11 is evaluated with Table A1 (Table of the Normal Dis- 
tribution) in the Appendix. In order to reject the null hypothesis, the obtained absolute value 
of z must be equal to or greater than the tabled critical two-tailed value at the prespecified level 
of significance. Since z = 1.11 is less than the tabled critical two-tailed values zo, = 1.96 and 
Zo, = 2.58, the null hypothesis cannot be rejected at either the .05 or .01 level. Thus, the null 
hypothesis that the true population correlation equals .80 is retained. 

If the alternative hypothesis is stated directionally, in order to reject the null hypothesis the 
obtained absolute value of z must be equal to or greater than the tabled critical one-tailed value 
at the prespecified level of significance (i.e, zo, = 1.65 or zg, = 2.33). Since z = 1.11 is 
less than Zo; = 1.65, the directional alternative hypothesis Hy: p > .80 is not supported. 
Note that the sign of the value of z computed with Equation 28.20 will be positive when the 
computed value of r is greater than the hypothesized value p,, and negative when the computed 
value of r is less than the hypothesized value py. Since r = .955 is a positive number, the 
directional alternative hypothesis H,: p < .80 is inconsistent with the data, and is thus not 
supported. 


6. Computation of power for the Pearson product-moment correlation coefficient Prior 
to collecting correlational data, a researcher can determine the likelihood of detecting a 
population correlation of a specific magnitude if a specific value of n is employed. As a result 
of such a power analysis, one can determine the minimum sample size required to detect a 
prespecified population correlation. To illustrate the computation of power, let us assume that 
prior to collecting the data for 25 subjects a researcher wants to determine the power associated 
with the analysis if the value of the population correlation he wants to detect is p = .40.'° It will 
be assumed that a nondirectional analysis is conducted, with a = .05. 

Equation 28.21 (which is described in Guenther (1965, pp. 244—246)) is employed to 
compute the power of the analysis. 


6 = IZ, - Zp, vn - 3 (Equation 28.21) 


Where: Zp is the Fisher transformed value of the population correlation stipulated in the 
0 
null hypothesis, and ^ is the Fisher transformed value of the population correlation 
1 
the researcher wants to detect 


Table A17 in the Appendix reveals that the Fisher transformed value associated with 
— 01s uro 0, and thus when the null hypothesis H,: p = 0 is employed (which it will 
Pe assumed is the case), Equation 28.21 reduces to 6 = zy Vn = 3: 
Employing Table A17, we determine that the Fisher iran@foimned value for the population 
correlation of p = .40 is f= .424. Substituting the appropriate values in Equation 28.21, the 
value 6 = 1.99 is computed. 


ò = |0 - 424|/25 - 3 = 1.99 
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The obtained value 6 = 1.99 is evaluated with Table A3 (Power Curves for Student's t 
Distribution) in the Appendix. A full discussion on the use of Table A3 (which is employed 
to evaluate the power of a number of different types of f tests) can be found in Section VI of the 
single-sample ¢ test (Test 2). Employing the power curve for df = » in Table A3-C (the 
appropriate table for a nondirectional/two-tailed analysis, with a = .05), we determine the power 
of the correlational analysis to be approximately .52.'* Thus, if the underlying population 
correlation is p = .40 and a sample size of n = 25 is employed, the likelihood of the researcher 
rejecting the null hypothesis is only .52. If this value is deemed too small, the researcher can 
substitute larger values of n in Equation 28.21 until a value is computed for 6 that is associated 
with an acceptable level of power. 

Equation 28.21 can also be employed if the value stated in the null hypothesis is some value 
other than p 2 0. Assume that a number of studies suggest that the population correlation 
between two variables is p = .60. A researcher, who has reason to believe that the latter value 
may overestimate the true population correlation, wants to compute the power of a correlational 
analysis to determine if the true population correlation is, in fact, p = .40. In this example (for 
which the value n = 25 will be employed) Hy: p = .60. Since the researcher believes the true 
population correlation may be less than .60, the alternative hypothesis is stated directionally. 
Thus, H,;: p < .60. 

Employing Table A17, we determine the Fisher transformed value for p = 60 is 2 .693 
and, from the previous analysis, we know that for p — .40, 2,7 .424. Substituting the 
appropriate values in Equation 28.21, the value 6 = 1.26 is computed. 


6 = |.693 - .424| /25 - 3 = 1.26 


Employing the power curve for df = ~ in Table A3-D (i.e., the curves for the one-tailed 
.05 value), we determine the power of the analysis to be approximately .37. Thus, if the 
underlying population correlation is p = .40 and a sample size of n = 25 is employed, the 
likelihood of the researcher rejecting the null hypothesis Hj: p = .60 is only .37. 

Cohen (1977; 1988, Chapter 3) has derived tables that allow a researcher to determine the 
appropriate sample size to employ if one wants to evaluate an alternative hypothesis which desig- 
nates a specific value for a population correlation (when the null hypothesis is Hj: p = 0). 
These tables can be employed as an alternative to the procedure described in this section in 
computing power for the Pearson product-moment correlation coefficient. 


7. Test 28c: Test for evaluating a hypothesis on whether there is a significant difference 
between two independent correlations There are occasions when a researcher will compute 
a correlation between the same two variables for two independent samples. In the event the 
correlation coefficients obtained for the two samples are not equal, the researcher may wish to 
determine whether the difference between the two correlations is statistically significant. The 
null and alternative hypotheses that are evaluated under such conditions are as follows. 


Hy: p, = p, 
(In the underlying populations represented by the two samples, the correlation between the two 
variables is equal.) 

Ay: Pi * Py 


(In the underlying populations represented by the two samples, the correlation between the two 
variables is not equal. The alternative hypothesis as stated is nondirectional, and is evaluated 
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with a two-tailed test. It is also possible to state the alternative hypothesis directionally 
(H: p, > p, or Hi: p, € pj), in which case it is evaluated with a one-tailed test.) 


To illustrate, let us assume that in Example 28.1 the correlation of r = .955 between the 
number of ounces of sugar eaten per week and the number of cavities is based on a sample of five 
ten-year old boys (to be designated Sample 1). Let us also assume that the researcher evaluates 
asample of five ten-year old girls (to be designated Sample 2), and determines that in this second 
sample the correlation between the number of ounces of sugar eaten per week and the number 
of cavities is r 2 .765. Equation 28.22 can be employed to determine whether or not the 
difference between r, - .955 and r, - .765 is significant. 


(Equation 28.22) 





Where: Z, represents the Fisher transformed value of the computed value of r, for Sample 
z, represents the Fisher transformed value of the computed value of r, for Sample 
n, and n, are, respectively, the number of subjects in Sample 1 and Sample 2 


Since there are five subjects in both samples, n, = n, = 5. From the analysis in the 
previous section we already know that the Fisher transformed value of r, - .955 is 
i> 1.886. For the female sample, employing Table A17 we determine that the Fisher 
transformed value of r, = .765 is E 1.008. When the appropriate values are substituted 


in Equation 28.22, they yield the value z = .878. 


1.886 - 1.008 


1 1 
5-3. 5-3 


- .878 











The value z 2 .878 is evaluated with Table A1. In order to reject the null hypothesis, the 
obtained absolute value of z must be equal to or greater than the tabled critical two-tailed value 
at the prespecified level of significance. Since z = .878 is less than the tabled critical two-tailed 
values Zo; = 1.96 and zy, = 2.58, the nondirectional alternative hypothesis H,: p, * p, is 
not supported at either the .05 or .01 level. Thus, we retain the null hypothesis that there is an 
equal correlation between the two variables in each of the populations represented by the samples. 

If the alternative hypothesis is stated directionally, in order to reject the null hypothesis the 
obtained absolute value of z must be equal to or greater than the tabled critical one-tailed value 
at the prespecified level of significance (i.e., zog, = 1.65 or zy, = 2.33). The sign of z will 
be positive when r, > r,, and thus can only support the alternative hypothesis H;: p, > p,- 
The sign of z will be negative when r, < r,, and thus can only support the alternative 
hypothesis H,: p, < p,. Since z = .878 is less than Zo; = 1.65, the directional alternative 
hypothesis H,: p, > p, is not supported. 

Edwards (1984) notes that when the null hypothesis is retained, since the analysis suggests 
that the two samples represent a single population, Equation 28.23 can be employed to provide 
a weighted estimate of the common population correlation. 
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(n - 3)z, + (m, - 32, 
£, = ——M———— (Equation 28.23) 
(nj - 3) * (n, - 3) 


Substituting the data in Equation 28.23, the Fisher transformed value z, - 1.447 is 
computed. 


pu (5 - 31.886) + (5 - 31.008) _ 1447 
í (S - 3) + G - 3) 

Employing Table A17, we determine that the Fisher transformed value z, = 1.447 
corresponds to the value r = .895. Thus, r = .895 can be employed as the best estimate of the 
common population correlation. Note that the estimated common population correlation 
computed with Equation 28.23 is not the same value that is obtained if, instead, one calculates 
the weighted average of the two correlations (which, since the sample sizes are equal, is the 
average of the two correlations: (.955 + .765)/2 = .86). The fact that the weighted average of 
the two correlations yields a different value from the result obtained with Equation 28.23 can be 
attributed to fact that the theoretical sampling distribution of the correlation coefficient becomes 
more skewed as the absolute value of r approaches 1. 

Cohen (1977; 1988, Ch. 4) has developed a statistic referred to as the g index that can be 
employed for computing the power of the test comparing two independent correlation 
coefficients. A brief discussion of the g index can be found in Section IX (the Addendum) 
under the discussion of meta-analysis and related topics. 


8. Test 28d: Test for evaluating a hypothesis on whether k independent correlations are 
homogeneous Test 28c can be extended to determine whether more than two independent 
correlation coefficients are homogeneous (in other words, can be viewed as representing the 
same population correlation, p). The null and alternative hypotheses that are evaluated under 
such conditions are as follows. 


Ay: Py = Py =" = Py 


(In the underlying populations represented by the k samples, the correlation between the two 
variables is equal.) 


H,: Not H, 


(In the underlying populations represented by the k samples, the correlation between the two 
variables is not equal in at least two of the populations. The alternative hypothesis as stated is 
nondirectional.) 


To illustrate, let us assume that the correlation between the number of ounces of sugar eaten 
per week and the number of cavities is computed for three independent samples, each sample 
consisting of five children living in different parts of the country. The values of the correlations 
obtained for the three samples are as follows: r, = .955 , r, = .765 , r} = .845. 

Equation 28.24 is employed to determine whether the k = 3 sample correlations are 
homogeneous. (Equation 28.89 in Section IX (the Addendum) is a different but equivalent 
version of Equation 28.24.) In Equation 28.24, wherever the summation sign Nx appears it 
indicates that the operation following the summation sign is carried out for each of the k = 3 
samples, and the resulting k 2 3 values are summed. 
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k 2 


k D - 3)z, 
ee [n - 3) z] - d L (Equation 28.24) 
j-1 J 
Yan - 3) 
j4 


Since there are five subjects in each sample, n, = n, = n, = 5. From the analysis in the 
previous section, we already know that the Fisher transformed values of r, - .955 and 
r, = .765 are, respectively, zc 1.886 and z= 1.008. Employing Table A17, we determine 
for Sample 3 the Fisher transformed value of r, = .845 is z= 1.238. When the appropriate 
values are substituted in Equation 28.24, they yield the value y? = .83. 


x? = [G - 3)(1.886)? + (5 - 3)(1.008)? + (5 - 3)(1.238)"] 


_ [6 -3)(1.886) + (5 - 3)(1.008) + (5 - 3)(1.238)P _ 
(5-3) + (5-3) + (5-3) 


83 


The value xy? = .83 is evaluated with Table A4 (Table of the Chi-Square Distribution) 
in the Appendix. The degrees of freedom employed in evaluating the obtained chi-square 
value are df = k - 1. Thus, for the above example, df= 3 - 1 =2. In order to reject the null 
hypothesis, the obtained value of 4? must be equal to or greater than the tabled critical value 
at the prespecified level of significance. Since y? = .83 is less than os = 5.99 and 
Xo = 9.21 (which are the tabled critical .05 and .01 values for df = 2 when a nondirectional 
alternative hypothesis is employed), the null hypothesis cannot be rejected at either the .05 or 
.01 level. Thus, we retain the null hypothesis that in the underlying populations represented by 
the k = 3 samples, the correlations between the two variables are equal.” 

Edwards (1984) notes that when the null hypothesis is retained, since the analysis suggests 
that the k samples represent a single population, Equation 28.25 can be employed to provide a 
weighted estimate of the common population correlation. 


k 
bx n, - 32, 
REL. I (Equation 28.25) 


k 
20-9) 
j=l 


Substituting the data in Equation 28.25, the Fisher transformed value z, = 1.377 is 
computed. 


zs (5 - 3)(1.886) + (5 - 3)(1.008) + (5 - 3)(1.238) _ 1377 
, (S - 3) +G - 3) +G - 3) 

Employing Table A17, we determine that the Fisher transformed value z, = 1.377 cor- 
responds to the value r 2.88. Thus, r 2.88 can be employed as the best estimate of the common 
population correlation. Note that, as is the case when the same analysis is conducted for k = 2 
samples, the value obtained for the common population correlation (using Equation 28.25) is not 
the same as the value that is obtained if the weighted average of the three correlation coefficients 
is computed (1.e., (.955 + .765 + .845)/3 = .855). 
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9. Test 28e: Test for evaluating the null hypothesis H,: py; 2 py; There are instances when 
a researcher may want to evaluate if, within a specific population, one variable (X) has the same 
correlation with some criterion variable (Z) as does another variable (Y). The null and alternative 
hypotheses which are evaluated in such a situation are as follows. 


Hy Py; = Pyz 
(In the underlying population represented by the sample, the correlation between variables X and 
Z is equal to the correlation between variables Y and Z.) 


Hy Pyz * Pyz 
(In the underlying population represented by the sample, the correlation between variables X and 
Zis not equal to the correlation between variables Y and Z. The alternative hypothesis as stated 
is nondirectional, and is evaluated with a two-tailed test. It is also possible to state the 
alternative hypothesis directionally (H,: py, > Py, or H,: py? < Py), in which case it is 
evaluated with a one-tailed test.) 


To illustrate how one can evaluate the null hypothesis Hy: oy; = Pyz, let us assume that 
the correlation between the number of ounces of sugar eaten per week and the number of cavities 
is computed for five subjects, and r = .955. Let us also assume that for the same five subjects 
we determine that the correlation between the number of ounces of salt eaten per week and the 
number of cavities is r = .52. We want to determine whether there is a significant difference 
in the correlation between the number of ounces of sugar eaten per week and the number of 
cavities versus the number of ounces of salt eaten per week and the number of cavities. Let us 
also assume that for the sample of five subjects, the correlation between the number of ounces 
of sugar eaten per week and the number of ounces of salt eaten per week is r = .37. 

Inthe above example, within the framework of the hypothesis being evaluated we have two 
predictor variables — the number of ounces of sugar eaten per week and the number of ounces 
of salt eaten per week. These two predictor variables will, respectively, represent the X and Y 
variables in the analysis to be described. The number of cavities, which is the criterion variable, 
will be designated as the Z variable. Thus, ry, = .955, ry, = .52, ry, = .37. 

The test statistic for evaluating the null hypothesis, which is based on the f distribution, is 
computed with Equation 28.26. A more detailed description of the test statistic can be found in 
Steiger (1980), who notes that Equation 28.26 provides a superior test of the hypothesis being 
evaluated when compared with an alternative procedure developed by Hotelling (1940) (which 
is described in Lindeman et al. (1980)). 


(Equation 28.26) 





t = CE E SEEMS Sm) 
(n. 
(n - 3) 


2 





[| - rg? 





2 
At > 





[ Jo suc - ri HOR moro 
yz ` Tyg Ty yz "xz "xr. 


Substituting the values n = 5, ry, = .955, ry, = .52, and ry, = .37 in Equation 28.26, 
the value t = - 1.78 is computed. 
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(5-1)(1+.37) 


252 - (529. - (955) -(.37y *2(.52)(,955)(.37)] +} Pi (319 


~ 
Il 


(.52-.955) 





- -1.78 


The value t = -1.78 is evaluated with Table A2. The degrees of freedom employed in 
evaluating the obtained ¢ value are df = n - 3. Thus, for the above example, df= 5 - 3=2=2. 
In order to reject the null hypothesis, the obtained absolute value of t must be equal to or greater 
than the tabled critical value at the prespecified level of significance. Since the absolute value 
t= 1.78 is less than t; = 4.30 and fy, = 9.93 (which are the tabled critical two-tailed values 
for df = 2), the null hypothesis cannot be rejected at either the .05 or .01 level. Thus, we retain 
the null hypothesis that the population correlation between variables X and Z is equal to the 
population correlation between variables Y and Z. 

If the alternative hypothesis is stated directionally, in order to reject the null hypothesis, the 
obtained absolute value of t must be equal to or greater than the tabled critical one-tailed value 
at the prespecified level of significance (which for df= 2 are t9; = 2.92 ort), = 6.97). The 
sign of t must be positive if H,: py, < py? and must be negative if H,: py, > Py,. Since the 
absolute value ¢ = 1.78 is less than ft), = 2.92, the directional alternative hypothesis 
H: py, > py? is not supported. 

In the event the ¢ value obtained with Equation 28.26 is significant, it indicates that the 
predictor variable which correlates highest with the criterion variable (i.e., the one with the high- 
est absolute value) is the best predictor of subjects' scores on the latter variable. It should be 
noted that because the analysis discussed in this section represents a dependent samples analysis 
(since all three correlations are based on the same sample), Equation 28.22 (the equation for 
contrasting two independent correlations) is not appropriate to use to evaluate the null hypothesis 


Hy Pxz = Pyz- 


10. Tests for evaluating a hypothesis regarding one or more regression coefficients A 
number of tests have been developed that evaluate hypotheses concerning the slope of a regres- 
sion line (which is also referred to as a regression coefficient). This section will present a brief 
description of such tests. In the statement of the null and alternative hypotheses of tests con- 
cerning a regression coefficient, the notation f is employed to represent the slope of the line in 
the underlying population represented by a sample. Thus, f, is the population regression coef- 
ficient of the regression line of Y on X, and D, is the population regression coefficient of the 
regression line of X on Y. 


Test 28f: Test for evaluating the null hypothesis H;: B = 0 A test of significance can be 
conducted to evaluate the hypothesis of whether, in the underlying population, the value of the 
slope of a regression line is equal to zero. The null hypotheses that can be evaluated in reference 
to the two regression lines are Hy: D, = 0 and Hy: By = 0. In point of fact, the test of the 
generic null hypothesis H,: B = 0 will always yield the same result as that obtained when the 
null hypothesis Hy: p = 0 is evaluated using Test 28a (which employs Equation 28.3). This 
is the case, since whenever, p = 0, the slope of a regression line in the underlying population will 
also equal zero. Equations 28.27 and 28.28 are, respectively, employed to evaluate the null 
hypotheses Hy: B, = 0 and H,: By = 0. The equations are employed below with the data for 
Example 28.1 and, in both instances, yield the same * value as that obtained when the null 
hypothesis Hy: p = O is evaluated with Equation 28.3. The slight discrepancies between the 
t values computed with Equations 28.3, 28.27, and 28.28 are the result of rounding off error. 
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5.60 (Equation 28.27) 
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s 2.94 - 5.57 (Equation 28.28) 
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The ¢ values computed with Equations 28.27 and 28.28 are evaluated with Table A2. The 
degrees of freedom employed in evaluating the obtained £ values are df = n - 2. Since both t 
values are identical to the t value computed with Equation 28.3 and the same degrees of freedom 
are employed, interpretation of the t values leads to the same conclusions — except, in this case, 
the conclusions are in reference to the regression coefficients B, and Bẹ. Thus, the 
nondirectional alternative hypotheses H,: By + O0 and H,: By + 0 are supported at the .05 level, 
since (for df = 3) the computed t values are greater than the tabled critical two-tailed value 
tos = 3.18. The directional alternative hypotheses H,: D, > 0 and H,: By > O are 
supported at both the .05 and .01 levels, since the sign of t (as well as the sign of each of the 
regression coefficients) is positive, and the computed f values are greater than the tabled critical 
one-tailed values ty, = 2.35 and t,, = 4.54. 

Equations 28.29 and 28.30 can, respectively, be employed to compute confidence intervals 
for the regression coefficients B, and By. The computation of the 95% confidence interval for 
the two regression coefficients is demonstrated below. The value £9; = 3.18 (which is also 
employed in computing the confidence intervals derived with Equations 28.16 and 28.17) is 
employed in Equations 28.29 and 28.30 to represent f, since in Table A2 it is the tabled 
critical two-tailed .05 t value for df = 3 (which is computed with df = n - 2). 








(Equation 28.29) 
: 92 
Ch, = by + GD BE = 30 + @.18)| —**__] = 30 + 17 
um mM 858/51 
(Equation 28.30) 
S 
Ch, = By £9 —ÀÀL | = 3.03 + G.18| 7" __] = 3.03 + 1.73 
ao (6)vn - 1 2.70/5 - 1 














The above results indicate the following: a) There is a 95% likelihood that the popula- 
tion regression coefficient D, falls within the range .13 and .47 (i.e., .13 < By < .47); and 
b) There is a 95% likelihood that the population regression coefficient B, falls within the range 
1.30 and 4.76 (ie., 1.30 < By < 4.76). Since the nondirectional alternative hypotheses 
H: By # 0 and H,: B, + O are supported at the .05 level, it logically follows that the value 
zero will not fall within the range that defines either of the confidence intervals. 


Test 28g: Test for evaluating the null hypothesis H,: D, = B, A test of significance can be 
conducted to evaluate whether the slopes of two regression lines obtained from two independent 
samples are equal to one another. Asis the case with Test 28c, it is assumed that the correlations 
for the independent samples are for the same two variables. The null hypotheses evaluated by 
the test for the regression lines of Y on X and the regression lines of X on Y are, respectively, 
Hy: Py, = Py, and H,: By = By (where By represents the slope of the regression line of Y on 


X in the underlying population represented by Sample i , and B, represents the slope of the 


© 2000 by Chapman & Hall/CRC 


regression line of X on Y in the underlying population represented by Sample i). As a result of 
evaluating the null hypothesis in reference to two independent regression lines of Y on X, a 
researcher can determine if the degree of change on the Y variable when the X variable is 
incremented by one unit is equivalent in the two samples. In the case of two independent regres- 
sion lines of X on Y, a researcher can determine if the degree of change on the X variable when 
the Y variable is incremented by one unit is equivalent in the two samples. It should be noted that 
the test employed to evaluate the generic null hypothesis H): D, = D, is not equivalent to Test 
28c, which evaluates the null hypothesis Hy: p, = p,. This is the case since (as is illustrated 
in Figure 28.2) it is entirely possible for the regression lines associated with two independent 
correlations to have identical slopes, yet be associated with dramatically different correlations. 

Equations 28.31 and 28.32 are employed to evaluate the null hypotheses H: By = By 


and A): By = By. The t values computed with Equations 28.31 and 28.32 are evaluated with 


Table A2. The degrees of freedom employed in evaluating the obtained ¢ values are 
df = (n,- 2)+(, - 2) =n, +n, - 4." 


(Equation 28.31) 








(Gm, - 1) (Kn, - 1) 


(Equation 28.32) 








Gm -1 (0-12) 


Equations 28.31 and 28.32 can be employed to evaluate the regression coefficients associ- 
ated with the two independent correlations described within the framework of the example 
employed to demonstrate Test 28c. If, for instance, the regression coefficient by computed for 
boys (who will represent Sample 1) is larger than the regression coefficient b, computed for 
girls (who will represent Sample 2), Equation 28.31 can be employed to evaluate the null 
hypothesis H,: By = p Y Although the full analysis will not be done here, the following values 
would be used in Equation 28.31 for Sample 1/boys (whose data arg the same as that employed 
in Example 28.1): n, = 5, by = .30, & = (8.58) = 73.62, sc = (.92) = .85. If upon 
substituting the analogous values for a sample of five girls in Equation 28.31 the resulting t value 
is significant, the null hypothesis Hp: By = By is rejected. The number of degrees of free- 
dom employed for the analysis are df=5 +5 - 4=6 = 6. The tabled critical .05 and .01 two- 
tailed and one-tailed t values for df = 6 are, respectively, ty, = 2.45 and ty), = 3.71, and 
tos = 1.94 and t,, = 3.14. If the nondirectional alternative hypothesis H,: By # By. is 
employed, in order to be significant the obtained absolute value of t must be equal to or greater 
than the tabled critical two-tailed value at the prespecified level of significance. If the directional 
alternative hypothesis H;: By > By, is employed, in order to be significant the computed f 
value must be a positive number that is equal to or greater than the tabled critical one-tailed value 
at the prespecified level of significance. If the directional alternative hypothesis H: By < p Y, 
is employed, in order to be significant, the computed t value must be a negative number that has 
an absolute value which is equal to or greater than the tabled critical one-tailed value at the pre- 
specified level of significance.'* 
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11. Additional correlational procedures At the conclusion of the discussion of the Pearson 
product-moment correlation coefficient, an Addendum (Section IX) has been included which 
provides a description of the following additional correlational procedures that are directly or 
indirectly related to the Pearson product-moment correlation coefficient: a) Test 28h: The 
point-biserial correlation coefficient; b) Test 28i: The biserial correlation coefficient; c) 
Test 28j: The tetrachoric correlation coefficient; d) Test 28k: The multiple correlation 
coefficient; e) Test 281: The partial correlation coefficient; f) Test 28m: The semi-partial 
correlation coefficient. The Addendum also contains additional material that is relevant to the 
general subject of correlation. 


VII. Additional Discussion of the Pearson Product-Moment 
Correlation Coefficient 


1. The definitional equation for the Pearson product-moment correlation coefficient 
Although more computationally tedious than Equation 28.1, Equation 28.33 is a conceptually 
more meaningful equation for computing the Pearson product-moment correlation coefficient. 
Unlike Equation 28.1, which allows for the quick computation of r, Equation 28.33 reveals the 
factthat Pearson conceptualized the product-moment correlation coefficient as the average of the 
products of the paired z scores of subjects on the X and Y variables. 


n 
» ety 
_ de 


r= (Equation 28.33) 


n- 1 


Where: a X)/$ Sy and £y = (Y, - Y/s$ $y, with X, and Y, representing the scores of 
E i" subject on p X and Y variables 


As noted above, the correlation coefficient is, in actuality, the mean of the product of each 
subject's X and Y scores, when the latter are expressed as z scores. Since the computed r value 
represents an average score, many books employ n as the denominator of Equation 28.33 instead 
of (n - 1). In point of fact, n can be employed as the denominator of Equation 28.33 if, in 
computing the Zy, and Zy, Scores, the sample standard deviations są and s, (computed with 
Equation I.7) are employed i in place of the estimated population standard deraou Sy and 
§, (computed with Equation I.8). When the estimated population standard deviations are em- 
ployed, however, (n - 1) is the appropriate value to employ in the denominator of Equation 
28.33. 

In employing Equation 28.33 to compute the value of r, initially the mean and estimated 
population standard deviation of the X and Y scores must be computed. Each X score is then 
converted into a z score by employing the equation for converting a raw score into a z score 
(i.e. Ey = (X, - X)/$,). Each Y score is also converted into a z score using the same equation 
with reference to the Y variable (i.e. > Zy, = (Y, - Ys, ).? The product of each subject's Zy, 
and Zy, score is obtained, and the sum of the products for the n subjects is computed. The latter 
sum is divided by (n - 1), yielding the value of r which represents an average of the sum of the 
products. The value of r computed with Equation 28.33 will be identical to that computed 
with Equation 28.1. 

The computation of the value r = .955 with Equation 28.33 is demonstrated in Table 28.2. 
In ii denying the Zy, and Zy, scores, the following summary values are employed: X = 7.2, 

= 8.58, Y - 3. 4, $, = 2.70. Equation 28.33 yields the value r = 3.818/4 = .955 when 
" sum of the products in the last column of Table 28.2 is divided by (n - 1). 
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Table 28.2 Computation of r with Equation 28.33 








X, -X Y,- Y 
X; Zy = = Y £y = = Scy 
i 5 Sy 

20 1.49 7 1.33 1.982 
0 —.84 0 -1.26 1.058 
1 —.12 2 —.52 .374 
12 .56 5 59 .330 
3 —49 3 -.15 .074 

ÈE Zy Zy = 3.818 


2. Residuals In Section VI it is noted that a residual is the difference between the predicted 
value of a subject's score on the criterion variable and the subject's actual score on the criterion 
variable. Thus, a residual indicates the amount of error between a subject's actual and predicted 
scores. If e, represents the amount of error for the i "^ subject, the residual for a subject can be 
defined as follows (assuming the regression line of Y on X is employed): e; = (Y; - Y,’). In 
the least squares regression model, the sum of the residuals will always equal zero. Thus, 
146; = Mia(Y; - Y/) = 0. Since the sum of the residuals equals zero, the average of the 
residuals will also equal zero (i.e., e; - Orel n = 0). The latter reflects the fact that for 
some of the subjects the predicted value of Y,’ will be larger than Y,, while for other subjects the 
predicted value of Y, will be smaller than Y; (of course, for some subjects Y,’ may equal Y,). 
It should be noted that if the sum of the squared distances of the data points from the regression 
line is not the minimum possible value, the sum of the residuals will be some value other than 
Zero. 
ars = ML (Y, - Y, (which is the sum of the squared residuals) provides an index 
of the accuracy of prediction that results from use of the regression equation. When the sum of 
the squared residuals is small, prediction will be accurate but when it is large, prediction will be 
inaccurate. When the sum of the squared residuals equals zero (which will only be the case when 
|r| = 1), prediction will be perfect. The latter statement, however, only applies to the scores of 
subjects in the sample employed in the study. It does not ensure that prediction will be perfect 
for other members of the underlying population the sample represents. The accuracy of 
prediction for the population will depend upon the degree to which the derived regression 
equation is an accurate estimate of the actual regression equation in the underlying population. 
The partitioning of the variation on the criterion variable in the least squares regression 
model can be summarized by Equation 28.34. 


EQ ur Gat ea 
i i=l i=l (Equation 28.34) 


Total variation = Explained variation + Error variation 


Note that in Equation 28.34, the error (unexplained) variation is the sum of the squared 
residuals. When |r| = 1, M M i Yy = 0, which as noted earlier results in perfect 
prediction. When, on the other hand, r = 0, Bia Oe - Y) = 0, and thus the value Y will be 
the predicted value of Y' for each subject (since, using Equations 28.6 and 28.7, ifr=0, b, = 0 
and a, = Y. If a, = Y, the value of Y' computed with Equation 28.4 is Y' = Y). 

Through use of the residuals, the coefficient of determination can be expressed with 
Equation 28.35. 
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i=1 


Equation 28.36 (which is the square root of Equation 28.35) represents an alternative (albeit 
more tedious) way of computing the correlation coefficient. 


(Equation 28.36) 





The value sy x is the residual variance (i.e., the variance of the residuals). The residual 
variance (which is the square of the value s, y computed with Equation 28.12) can be defined 
by Equation 28.37. The denominator of Equation 28.37 represents the degrees of freedom 
employed in the analysis. 


X gue 
Sy y = BL (Equation 28.37) 
n-2 
Equation 28.38, which is the square root of Equation 28.37, is an alternative (albeit more 
tedious) way of computing the standard error of estimate. Inspection of Equation 28.38 
reveals that the greater the sum of the squared residuals, the greater the value of s, y. 


(Equation 28.38) 





Everything that has been said about the residuals with reference to the regression line of 
Y on X can be generalized to the regression line of X on Y (in which case the residual for each 
subject is represented by e, = (X, - X,’)). Thus, all of the equations described in this section 
can be generalized to the second regression line by respectively employing the values X;, X/', 
and X in place of Y,, Y,’, and Y. 


3. Covariance In Section IV it is noted that the numerator of Equation 28.1 is referred to as 
the sum of products. When the sum of products is divided by (n - 1), the resulting value 
represents a measure that is referred to as the covariance. Equation 28.39 is the computational 
equation for the covariance. 


n 


cov (Equation 28.39) 


XY RET) 

Equation 28.40 is the definitional equation of the covariance, which reveals the fact that 
covariance is an index of the degree to which two variables covary (i.e., vary in relation to one 
another). 
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Each subject's contribution to the covariance is computed as follows: The difference be- 
tween a subject's score on the X variable and the mean of the X variable, and the difference 
between a subject's score on the Y variable and the mean of the Y variable are computed. The 
two resulting deviation scores are multiplied together. The resulting product represents that 
subject’s contribution to the covariance. Upon obtaining a product for all n subjects, the sum of 
the n products (which is the numerator of Equation 28.1) is divided by (n - 1). The resulting 
value represents the covariance, which it can be seen is essentially the average of the products 
of the deviation scores. The reason why the sum of the products is divided by (n - 1) instead of 
n is because (as is also the case in computing the variance) division by the latter value provides 
a biased estimate of the population covariance. In the event one is computing a covariance for 
a sample and not using it as an estimate of the underlying population covariance, n is employed 
as the denominator of Equations 28.39 and 28.40. 

Inspection of Equation 28.40 reveals that subjects who are above the mean on both 
variables or below the mean on both variables will contribute a positive product to the 
covariance. On the other hand, subjects who are above the mean on one of the variables but 
below the mean on the other variable will contribute a negative product to the covariance. If all 
or most of the subjects contribute positive products, the covariance will be a positive number. 
Since the value of r is a direct function of the sign of the covariance (which is a function of the 
sum of products), the resulting correlation coefficient will be a positive number. If all or most 
of the subjects contribute negative products, the covariance will be a negative number, and, 
consequently, the resulting correlation coefficient will also be negative. When among the n 
subjects the distribution of negative and positive products is such that they sum to zero, the sum 
of products will equal zero resulting in zero covariance, and r will equal zero. If, for one of the 
two variables, all subjects obtain the identical score, each subject will yield a product of zero, 
resulting in zero covariance (since the sum of products will equal zero). However, as noted in 
Section VI, since the value for the sum of squares will equal zero for a variable on which all 
subjects have the same score, Equation 28.1 becomes insoluble when all subjects have the same 
score on either of the variables. Based on what has been said with respect to the relationship 
between the sum of products and the covariance, the computation of r can be summarized by 
either of the following equations: r = covy,/($,$y) and r = SPy,/ SS, SS,. 


4. Thehomoscedasticity assumption of the Pearson product-moment correlation coefficient 
It is noted in Section I that one of the assumptions underlying the Pearson product-moment 
correlation coefficient is a condition referred to as homoscedasticity (homo means same and 
scedastic means scatter). Homoscedasticity exists in a set of data if the relationship between the 
X and Y variables is of equal strength across the whole range of both variables. Data that are not 
homoscedastic are heteroscedastic. When data are homoscedastic the accuracy of a prediction 
based on the regression line will be consistent across the full range of both variables. To illus- 
trate, if data are homoscedastic and a strong positive correlation is computed between X and Y, 
the strong positive correlation will exist across all values of both variables. However, if for high 
values of X the correlation between X and Y is a strong positive one, but the strength of this 
relationship decreases as the value of X decreases, the data are heteroscedastic. As a general 
rule, if the distribution of one or both of the variables employed in a correlation is saliently 
skewed, the data are likely to be heteroscedastic. When, however, the data for both variables are 
distributed normally, the data will be homoscedastic. 
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Figure 28.6 presents two regression lines (which it will be assumed represent the regression 
line of Y on X) and the accompanying data points. Note that in Figure 28.6a, which represents 
homoscedastic data, the distance of the data points from the regression line is about the same 
along the entire length of the line. Figure 28.6b, on the other hand, represents heteroscedastic 
data, since the data are not dispersed evenly along the regression line. Specifically, in Figure 
28.6b the data points are close to the line for high values of X, yet as the value of X decreases, 
the data points become further removed from the line. Thus, the strength of the positive 
correlation is much greater for high values of X than it is for low values. This translates into 
the fact that a subjects Y score can be predicted with a greater degree of accuracy if the subject 
has a high score on the X variable as opposed to a low score. Directly related to this is the fact 
that the value of the standard error of estimate computed with Equation 28.12 (s, y) will not be 
a representative measure of error variability for all values of X. Specifically, when $, y com- 
puted with Equation 28.14 (which is a function of the value s, y) is employed to compute a 
confidence interval through use of Equation 28.16, the value of $, y will be larger for subjects 
who have a low score on the X variable, and thus the confidence interval associated with the 
predicted scores of such subjects will be larger than the confidence interval for subjects who have 
a high score on the X variable. 





(a) Homoscedastic data (b) Heteroscedastic data 
Figure 28.6 Homoscedastic Versus Heteroscedastic Data 


5. The phi coefficient as a special case of the Pearson product-moment correlation coef- 
ficient A number of the correlational procedures discussed in this book represent special cases 
of the Pearson product-moment correlation coefficient. One of the procedures, the phi coef- 
ficient, is described in Section VI of the chi-square test for r x c tables (Test 16). Another 
of the procedures, the point-biserial correlation coefficient, is described in Section IX (the 
Addendum). A third procedure, Spearman's rank-order correlation coefficient, is discussed 
in the next chapter. 

In this section it will be demonstrated how the phi coefficient (9) can be computed with 
Equation 28.1. In the discussion of the latter measure of association, it is noted that the value of 
phi is equivalent to the value of the Pearson product-moment correlation coefficient that will 
be obtained if the scores 0 and 1 are employed with reference to two dichotomous variables in 
a 2 x 2 contingency table. Using the data for Examples 16.1/16.2 (which employ a 2 x 2 con- 
tingency table), the scores 0 and 1 are employed for each of the categories on the two variables. 
Table 28.3 summarizes the data. 

Table 28.3 reveals the following: 30 subjects have both an X score and a Y score of 0; 70 
subjects have an X score of 1 and a Y score of 0; 60 subjects have an X score of 0 and a Y score 
of 1; 40 subjects have both an X score and a Y score of 1. Employing this information we can 
determine XX - 110, XX? - 110, XY - 100, XY? - 100, XXY - 40. Substituting these 
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Table 28.3 Summary of Data for Examples 16.1/16.2 


X variable Row sums 
0 1 
; 0 a -30 b=70 100 
Y variable 4 21 80.  d=40 100 
Column sums 90 110 Total = 200 


values in Equation 28.1, the value r = .30 is computed. The latter value is identical to ọ = .30 
computed for Examples 16.1/16.2 with Equations 16.17 and 16.18. 
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6. Autocorrelation/serial correlation In Section IX (the Addendum) of the single-sample 
runs test (Test 10) a number of procedures are discussed that are employed in determining 
whether the ordering of a series of numbers is random. Among the procedures that are briefly 
discussed is autocorrelation, which is also known as serial correlation. In contrast to most of 
the tests of randomness discussed in Section IX of the single-sample runs test (which can only 
be employed with a discrete variable) autocorrelation can be employed to evaluate either a 
continuous or discrete variable for randomness. 

The most basic methodology that can be employed for autocorrelation is to pair each of the 
numbers in a series of n numbers with the number that follows it in the series. Upon doing this, 
the Pearson product-moment correlation coefficient between the resulting (n - 1) pairs of 
numbers is computed. It is also possible to pair each number with the number whose ordinal 
position is some value other than one digit after it. In other words, each number can be paired 
with the number that is two digits after it, three digits after it, etc., or, with the number that is one, 
two, three, etc. digits before it in the series. In autocorrelation the number of digits that separate 
two values that are paired with one another is referred to as the lag value. In the example to be 
employed in this section the lag +1 will be used, since each number will be paired with the 
number that is one ordinal position above it in the series. If, instead, each number is paired with 
the number that falls two ordinal positions above it in the series, the lag value is +2. If, on the 
other hand, each number is paired with the number that precedes it by one ordinal position in the 
series, the lag value is - 1. The higher the absolute value of the lag value, the fewer the number 
of pairs that will be employed in computing the correlation. Thus, if in a series of ten digits each 
number is paired with the number that is above it by two ordinal positions, there will only be 
n - 2 = 8 pairs of X and Y scores. This is the case, since the first two numbers in the series can 
only be X scores, and the last two numbers in the series can only be Y scores. Regardless of the 
lag value employed in an autocorrelation, if the sequence of numbers in a series is random, the 
computed value of the correlation coefficient should equal zero. 

One variant of the methodology described in this section (which is referred to as non- 
circular serial correlation) is a procedure referred to as circular serial correlation. In circular 
serial correlation every number in a series of n numbers is paired with another number, including 
any numbers in the series that do not have a number following it. Numbers that are not followed 
by any numbers are sequentially paired with the numbers at the beginning of the series. Thus, if 


© 2000 by Chapman & Hall/CRC 


the lag value is +1, the last number in the series is paired with the first number in the series. If 
the lag value is +2, the (n - 1)" number is paired with the first number in the series, and the n ” 
number is paired with the second number in the series. 

To illustrate autocorrelation, the following ten digit series of numbers will be evaluated: 
4,3,5,2, 1, 3, 2, 1, 1, 2. In Table 28.4 the ten digits are arranged sequentially from top to 
bottom in Column A. The same ten digits are arranged sequentially in Column B, except for the 
fact that they are arranged so that each digit in Column B is adjacent to the digit in Column A 
that directly precedes it in the series. If each pair of adjacent values is treated as a set of scores, 
the value in Column A can be designated as an X score, and the value in Column B can be 
designated as a Y score. If the latter is done, each of the ten digits in the series will at some point 
be designated as both an X score and a Y score, except for the first digit which will only be an 
X score and the last digit which will only be a Y score. 


Table 28.4 Arrangement of Numbers for Autocorrelation 


Column A Column B 

4 
4 3 
3 5 
5 2 
2 1 
1 3 
3 2 
2 1 
1 1 
1 2 
2 


Table 28.5, which contains the nine pairs of digits in Table 28.4, summarizes the required 
values for computing the Pearson product-moment correlation coefficient. Note that the value 
n = 9 is employed in computing the value of r, since that is the number of sets of paired scores. 

Employing Equation 28.1, the value r = .28 is computed for the correlation coefficient. 


53 _ CDO) 
ee 9 - 28 
70 - 22 sg - CO 

9 9 














Table 28.5 Data for Autocorrelation 


X x Y Y? XY 
4 16 3 9 12 
3 9 > 25 15 
5 25 2 4 10 
2 4 1 1 2 
1 1 3 9 3 
3 9 2 4 6 
2 4 1 1 2 
1 1 1 1 1 
1 1 2 4 2 
ÈX = 22 XX? = 70 XY = 20 XY? = 58 YXY = 53 
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If the usual criteria for evaluating an r value are employed, the degrees of freedom for the 
analysis are df = n - 229-2227. In Table A16, the tabled critical two-tailed .05 and .01 
values for df = 7 that are employed to evaluate the nondirectional alternative hypothesis 
H: p *Oare rg = .666 and ry, = .798. Since the computed value r = .28 is less than 
F o5 = -666, the null hypothesis Hy: p =0 cannot be rejected. Thus, the data do not indicate that 
the underlying population correlation is some value other than zero. 

In point of fact, the tabled critical values in Table A16 are not the most appropriate values 
for evaluating the value r = .28. This is the case, since the sampling distribution for a serial 
correlation coefficient is not identical to the sampling distribution upon which the values in 
Table A16 are based. The sampling distribution upon which Table A16 is based assumes that 
the n pairs of scores are independent of one another. Since in Table 28.5 all of the digits in the 
series (with the exception of the last digit) represent both an X and a Y variable, the latter 
assumption is violated. Because the pairs are not independent, the residuals derived from the 
data may also not be independent (independence of the residuals is an underlying assumption of 
the least squares regression model). Although not necessarily the case with pseudorandom 
numbers (i.e., a series of random numbers generated with a computer algorithm), it is common 
in autocorrelated data in business and economics for residuals to be dependent on one another. 
Most commonly, in the latter disciplines there is a positive correlation between residuals. When 
the latter is true, residuals of the same sign occur in clusters — i.e., residuals for adjacent pairs 
have the identical sign. When there is a negative autocorrelation, adjacent residuals tend to 
alternate between a positive and negative sign. 

Because of the fact that the residuals may not be independent, a sampling distribution other 
than the one upon which the critical values in Table A16 are based should be employed to 
evaluate the value r = .28 computed with Equation 28.1. Anderson (1942) demonstrated that in 
the sampling distribution for a serial correlation, the absolute value of a critical value at a 
prespecified level of significance is smaller than the corresponding critical value in Table A16. 
Furthermore, the limits that define a critical value at a prespecified level of significance are asym- 
metrical (i.e., the absolute value of a critical value will not be identical for a positive versus a 
negative r value). Anderson (1942) computed the critical two-tailed .05 and .01 values of r for 
values of n between 5 and 75 for lag +1. For large sample sizes he determined that Equation 
28.41 (which employs the normal distribution) can be used to provide a good approximation of 
the critical values of r when the lag value is +1.” 


pouce cd (Equation 28.41) 
(n - 1Y n-1 





Where:  zrepresents the tabled critical value in the normal distribution that corresponds to the 
prespecified level of significance employed in evaluating r 
n represents the total number of numbers in the series. Note that n is not the number 
of pairs of numbers employed in computing the coefficient of correlation. 


Employing the values derived by Anderson (1942) for the exact sampling distribution of 
the serial correlation coefficient (which is not reproduced in this book), it can be determined (for 
a = .05 and n = 10) that in order to reject the null hypothesis Hy: p = 0, the computed value of 
r must be equal to or greater than rg, = .360 or equal to or less than ry, = -.564.(The value 
of n used in Anderson’s (1942) table represents the total number of digits in the series and not 
the number of pairs of digits employed in computing the correlation.) Since r= .28 is less than 
Fas = -360, the null hypothesis can be retained. Thus, regardless of whether one employs Table 
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A16 or Anderson's (1942) critical values, the null hypothesis is retained. Nevertheless, the 
difference between the critical values in the two tables is substantial. 

Use of Anderson's (1942) tables and/or Equation 28.41 provide for a more powerful test 
of the alternative hypothesis H,: p # 0 than do the critical values in Table A16. Although the 
degree of discrepancy between a critical value in Table A16 and a critical value computed with 
Equation 28.41 decreases as the size of n increases, even for large sample sizes the absolute 
values in Table A16 are noticeably higher. It should be noted that use of Equation 28.41 with 
small samples yields absolute critical values that are too high. 

Itis noted in the discussion of tests of randomness under the single-sample runs test, that 
it is not uncommon for two or more of the available tests for determining randomness to yield 
conflicting results. Although autocorrelation is not considered to be among the most rigorous 
tests for randomness, if one conducts multiple autocorrelations on a series (i.e., for the lag values 
+1, 42, +3, etc. and - 1, -2, -3, etc.), and all or most lead to retention of the null hypothesis, such 
a protocol will provide a more authoritative analysis with respect to randomness than will the 
single analysis for lag +1 conducted in this section. It should be noted that if for a series of n 
numbers (where the value of n is large) an autocorrelation is conducted for every possible 
positive and negative lag value, just by chance it is expected that some of the computed serial 
correlations will be significant. Whatever prespecified alpha value the researcher employs will 
determine the proportion of significant correlations that can be obtained which will still allow 
one to retain the null hypothesis Hy: p = 0. 

One limitation of autocorrelation as a test of randomness should be noted. Assume that a 
researcher is evaluating a series in which in any trial a number can assume any one of k = 5 
possible values. For instance, in the example employed in this section it is assumed that the 
integer values 1, 2, 3, 4, 5 are the only possible values that can occur. In a truly random series 
of reasonable length, each of the five digits would be expected to occur approximately the same 
number of times. Yet, it is entirely possible to have a series of numbers in which one or more 
of the integer values do not even occur one time, yet the resulting autocorrelation is r 2 0. For 
instance, a computer can be programmed to generate a series of 1000 pseudorandom numbers 
employing the integer values 1, 2, 3,4, 5. Yet it is theoretically possible for the computer 
algorithm to generate 1000 digits, all of which are either 1 or 2. If the autocorrelation between 
the values of 1 and 2 that are generated is zero, it will suggest the sequence of numbers is 
random. Although it may be a random sequence for a population in which the only values the 
numbers may assume are the integer values 1 or 2, it is not a random series for a population in 
which the numbers may assume an integer value between 1 and 5. Whereas most of the other 
tests that are employed in evaluating randomness will identify this problem, autocorrelation will 
not. 

Autocorrelation and the derivation of the corresponding regression equations (referred to 
as autoregression) are complex subjects that are primarily discussed in books that deal with 
statistical applications in business and economics. Research in such fields as economics, bus- 
iness, and political science often employs autocorrelation for time series analysis, which is a 
methodology for studying the sequential progression of events. The results of a time series 
analysis can be useful in predicting future values for such variables as stock prices, sales 
revenues, crop yields, crime rates, weather, etc. Such predictions are predicated on the fact that 
significant data based on autocorrelation indicate sequential dependence with respect to the 
variable of interest. As is the case with Example 28.1, a regression equation that is derived from 
data that are autocorrelated is employed in making predictions. Use of a regression equation in 
this context is referred to as autoregression. Sources that discuss autoregression note that 
derivation of a regression equation through use of the method of least squares as described in 
Section VI will underestimate error variability and, consequently, will not provide the most accurate 
basis for prediction. For this reason when autocorrelation is employed with a set of data, alternative 
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procedures are recommended for making predictions, as well as for evaluating the null hypothe- 
sis Hy: p = 0. A procedure recommended in most sources is one developed by Durbin and 
Watson (1950, 1951, 1971). Among the sources that describe the Durbin-Watson test (which 
is only appropriate for a lag value of +1) are Chou (1989), Montgomery and Peck (1992), and 
Netter et al. (1983). The latter sources also describe other alternative approaches for auto- 
regression. Other books that discuss autocorrelation are Schmidt and Taylor (1970) and Banks 
and Carson (1984). 


VIII. Additional Examples Illustrating the Use of the Pearson 
Product-Moment Correlation Coefficient 


Two additional examples that can be evaluated with the Pearson product-moment correlation 
coefficient are presented in this section. Since the data for Examples 28.2 and 28.3 are identical 
to the data employed in Example 28.1, they yield the same result. 


Example 28.2 The editor of an automotive magazine conducts a survey to see whether it is 
possible to predict the number of traffic citations one receives for speeding based on how often 
a person changes his or her motor oil. The responses of five subjects on the two variables follow. 
(For each subject, the first score represents the number of oil changes (which represents the X 
variable), and the second score the number of traffic citations (which represents the Y variable).) 
Subject 1 (20, 7); Subject 2 (0, 0); Subject 3 (1, 2); Subject 4 (12, 5); Subject 5 (3, 3). Do the 


data indicate there is a significant correlation between the two variables? 


Example 28.3 A pediatrician speculates that the length of time an infant is breast fed may be 
related to how often a child becomes ill. In order to answer the question, the pediatrician 
obtains the following two scores for five three-year-old children: The number of months the 
child was breast fed (which represents the X variable) and the number of times the child was 
brought to the pediatrician's office during the current year (which represents the Y variable). 
The scores for the five children follow: Child 1 (20, 7); Child 2 (0, 0); Child 3 (1, 2); Child 4 
(12, 5); Child 5 (3, 3). Do the data indicate that the length of time a child is breast fed is related 
to the number of times a child is brought to the pediatrician? 


IX. Addendum 


The Addendum will discuss four additional topics that are directly or indirectly related to the 
general subject of correlational analysis. 

1) The first part of the Addendum describes three bivariate correlational measures that are 
related to the Pearson product-moment correlation coefficient. The three procedures that will 
be described are a) The point-biserial correlation coefficient (Test 28h); b) The biserial 
correlation coefficient (Test 28i); and c) The tetrachoric correlation coefficient (Test 28j). 

2) The second part of the Addendum describes the following multivariate correlational 
measures that are employed within the framework of multiple regression analysis: a) The 
multiple correlation coefficient (Test 28k); b) The partial correlation coefficient (Test 281); 
and c) The semi-partial correlation coefficient (Test 28m). The use of the term multivariate 
within the framework of the procedures to be described in this section implies that data for three 
or more variables are employed in the analysis. 

3) The third part of the Addendum provides a general overview of the following multi- 
variate statistical procedures, which directly or indirectly involve some form of correlational 
analysis: a) Factor analysis; b) Canonical correlation; and c) Discriminant analysis and 
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logistic regression. The presentation of the material in this section will be nonmathematical, and 
except for factor analysis (which will be described in greater detail), the description of each 
procedure will be brief. 

4) The fourth part of the Addendum discusses meta-analysis and related topics. Meta- 
analysis is methodology for pooling the results of multiple studies which evaluate the same 
general hypothesis. A major component of meta-analysis involves evaluating measures of effect 
size, which are correlational measures. Within the framework of the discussion of meta-analysis, 
criticisms that have been directed at the conventional hypothesis testing model will be 
considered. Specifically, the conventional hypothesis testing model employs the concept of 
statistical significance (as opposed to employing measures of effect size) as the criterion for 
defining the relationship between two or more variables in an experiment. Critics of the latter 
model argue that measures of effect size are more meaningful indicators than statistical 
significance of the nature and strength of the relationship between experimental variables. 


1. Bivariate measures of correlation that are related to the Pearson-product moment 
correlation coefficient ^ This section of the Addendum will describe three bivariate 
correlational measures that are related to the Pearson product-moment correlation coefficient. 
Each of the correlation coefficients to be described assumes that the scores on at least one of the 
variables can be expressed within the format of interval/ratio data, and that the underlying 
distribution of these scores is continuous and normal. Two of the correlational procedures 
assume that the underlying interval/ratio scores on one or both of the variables have been 
converted into a dichotomous (two category) format. A brief description of the three procedures 
follows: 

The point-biserial correlation coefficient (Test 28h) The point-biserial correlation 
coefficient (Typ) (which is a special case of the Pearson product-moment correlation coef- 
ficient) is employed if one variable is expressed as interval/ratio data, and the other variable is 
represented by a dichotomous nominal/categorical scale (1.e., two categories). 

The biserial correlation coefficient (Test 28i) The biserial correlation coefficient 
(r,) is employed if both variables are based on an interval/ratio scale, but the scores on one of 
the variables have been transformed into a dichotomous nominal/categorical scale. It provides 
an estimate of the value that would be obtained for the Pearson product-moment correlation 
coefficient if, instead of the dichotomized variable, one employed the scores on the underlying 
interval/ratio scale which the latter variable represents. 

The tetrachoric correlation coefficient (Test 28j) The tetrachoric correlation coeffi- 
cient (7,,,) is employed if both variables are based on an interval/ratio scale, but the scores on 
both of the variables have been transformed into a dichotomous nominal/categorical scale. It 
provides an estimate of the value that would be obtained for the Pearson product-moment 
correlation coefficient, if, instead of the dichotomized variables, one employed the scores on 
the underlying interval/ratio scales that the latter variables represent. 


Test 28h: The point-biserial correlation coefficient (r,,) As noted earlier, the point-biserial 
correlation coefficient represents a special case of the Pearson product-moment correlation 
coefficient. The point-biserial correlation coefficient is employed if one variable is expressed 
as interval/ratio data, and the other variable is represented by a dichotomous nominal/categorical 
scale. Examples of variables that constitute a dichotomous nominal/categorical scale are male 
versus female and employed versus unemployed. In using the point-biserial correlation coef- 
ficient, it is assumed that the dichotomous variable is not based on an underlying continuous 
interval/ratio distribution. If, in fact, the dichotomous variable is based on the latter type of dis- 
tribution, the biserial correlation coefficient (Test 28i) is the appropriate measure to employ. 
Examples of variables that are expressed in a dichotomous format, but which are based on an 
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underlying continuous interval/ratio distribution are pass versus fail and above average intel- 
ligence versus below average intelligence. Obviously, not everyone who passes (or fails) a test 
or a course performs at the same level. In the same respect, the distribution of intelligence of 
people who are above average or below average is not uniform. There are, of course, variables 
with respect to which it can be argued whether they are based on an underlying continuous 
distribution (such as perhaps handedness, which will be employed as a dichotomous variable in 
the example to be presented in this section). As is the case with the Pearson product-moment 
correlation coefficient, the range of values within which r,, can fall are -1 < rẹ < +1. 
Example 28.4 will be employed to illustrate the use of the point-biserial correlation coefficient. 


Example 28.4 A study is conducted to determine whether there is a correlation between 
handedness and eye-hand coordination. Five right-handed and five left-handed subjects are 
administered a test of eye-hand coordination. The test scores of the subjects follow (the higher 
a subject's score, the better his or her eye-hand coordination): Right-handers: 11, 1, 0, 2, 0; 
Left-handers: 11,11,5,8,4. Is there a statistical relationship between handedness and eye- 
hand coordination? 


In the analysis handedness will represent the X variable, and the eye-hand coordination test 
scores will represent the Y variable. With respect to handedness (which is a dichotomous vari- 
able), all right-handed subjects will be assigned a score of 1 on the X variable, and all left-handed 
subjects will be assigned a score of 0. Table 28.6 summarizes the data for the ten subjects 
employed in the study. 


Table 28.6 Data for Example 28.4 


Subject X Xx? Y y? XY 
1 1 1 11 121 11 
2 1 1 1 1 1 
3 1 1 0 0 0 
4 1 1 2 4 2 
5 1 1 0 0 0 
6 0 0 11 121 0 
7 0 0 11 121 0 
8 0 0 5 25 0 
9 0 0 8 64 0 
10 0 0 4 16 0 

YX =5 eX? = 5 XY = 53 XY? = 473 YXXY = 14 


Since the point-biserial correlation coefficient is a special case of the Pearson product- 
moment correlation coefficient, Equation 28.42 (which is identical to Equation 28.1) is 


employed to compute ^ Employing Equation 28.42, the value ne -.57 is computed. 
(Equation 28.42) 
yyy - EDEN 14 O63) 
NE n : ig = -.57 
Ey? E (LX)? Xy? _ (LY) 5 x Oy 473 - (53) 
n n 10 10 
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Equation 28.43 is an alternative equation for computing the point-biserial correlation 
coefficient. 


Y, - Y, 


n 
Ec dr 


FK 


2 = 








(Equation 28.43) 





- |^ - 8 G35.| 10 - os] 


4.62 10 - 1 


Where: Y) and Y, are, respectively, the average scores on the Y variable for subjects who are 
categorized 0 versus 1 on the X variable 
Po equals the proportion of subjects with an X score of 0 
p, equals the proportion of subjects with an X score of 1 


In employing Equation 28.43, Y, = 2.8 and Y, - 7.8. The value $, = 4.62, which 
represents the unbiased estimate of the population standard deviation for the Y variable (which 
is computed with Equation I.8), is computed below. 





It should be noted that some sources employ Equation 28.44 to compute the value of the 
point-biserial correlation coefficient. 


Y -Y, - 
a - tnn A E 8 (.5)(.5) = -.57 (Equation 28.44) 


: 4.38 


Note that Equation 28.44 employs the sample standard deviation (computed with Equation 
1.7 —ie., Sy = / IEY? - (QCYY/n)]/n ), which is a biased estimate of the population standard 
deviation. For Example 28.2, s, - 4.38. When s, - 4.38 issubstituted in the Equation 28.44, 
it yields the value ry = 57. 

The reader should take note of the fact that the sign of Typ ÍS irrelevant unless the categories 
on the dichotomized variable are ordered (which is not the case for Example 28.4). The reason 
for employing the absolute value of rjj is that the use of the scores 0 and 1 for the two categories 
is arbitrary, and does not indicate that one category is superior to the other. (If all right-handed 
subjects are assigned a score of 0 on the X variable and all left-handed subjects are assigned a 
score of 1, the value computed for ry, = +.57, which is the same absolute value computed for 
the data in Table 28.6.) Since the categories are not ordered, from this point on in the discussion, 
the absolute value na .57 will be employed to represent the value of r,,. In the event the 
categories are ordered, the score 1 should be employed for the category associated with higher 
performance/quality, and the score 0 should be employed for the category associated with lower 
performance/quality. In all likelihood, if the categories are ordered they are likely to be based 
on an underlying continuous distribution, and in such a case the appropriate correlational 
measure to employ is the biserial correlation coefficient. 

The square of the point-biserial correlation coefficient represents the coefficient of 
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determination, which as noted in Section VI indicates the amount of variability on the Y variable 
that can be accounted for by variability on the X variable. Since, s = (.57) = .325, 32.5% 
of the variability on the test of eye-hand coordination can be accounted for on the basis of a 
person’s handedness. 

The data employed for Example 28.4 are identical to that employed for Example 11.1 
(which is used to illustrate the ¢ test for two independent samples). In point of fact, the point- 
biserial correlation coefficient can be employed to measure the magnitude of treatment 
effect in an experiment that has been evaluated with a t test for two independent samples, 
if the grouping of the subjects is conceptualized as the dichotomous variable. Thus, in 
Example 11.1, if each subject in Group 1 (Drug Group) is assigned an X score of 1, and each 
subject in Group 2 (Placebo Group) is assigned an X score of 0, the data for the experiment 
can be summarized with Table 28.6. Since analysis of the data in Table 28.6 yields ro» = .57 
and T = .325, the researcher can conclude that 32.5% of the variability on the dependent 
variable (the depression ratings for subjects) can be accounted for on the basis of which group 
a subject is a member. 

In Section VI of the f test for two independent samples, the measure of association that 
is employed to measure the magnitude of treatment effect is the omega squared (7) statistic. 
The value of omega squared computed for Example 11.1 is à? = .22. Since à? is interpreted 
in the same manner as Ps the value à? - .22 indicates that 2296 of the variability on the 
dependent variable can be accounted for on the basis of which group a subject is a member. 
Obviously, the latter value is lower than the value ro = .325 computed in this section. The 
discrepancy between the two values will be discussed further later in this section. In point of 
fact, P = .325 is equivalent to the eta squared (Ñ?) statistic, which is an alternative measure 
of association that some sources employ in assessing the magnitude of treatment effect for the 
t test for two independent samples. Marascuilo and Serlin (1988) note that both £s and ij? 
represent a correlation ratio. The correlation ratio, which can be defined within the 
framework of an analysis of variance, is the ratio of the explained sum of squares over the total 
sum of squares. To clarify the meaning of a correlation ratio, let us assume that in lieu of the f 
test for two independent samples, the data for Example 11.1 are evaluated with the single- 
factor between-subjects analysis of variance (Test 21). Table 28.7 is the summary table of the 
analysis of variance for Example 11.1. 


Table 28.7 Summary Table of Analysis of Variance for Example 11.1 


Source of variation SS df MS F 
Between-groups 62.5 1 62.5 3.86 
Within-groups 129.6 8 16.2 

Total 192.1 9 


Within the framework of the single-factor between-subjects analysis of variance, the 
correlation ratio (which is computed with Equation 21.43) is fi? = SS,,/88,. Since both 
UB and rj, represent the correlation ratio, fj? = n, = SS,,/SS,. Thus, for Example 11.1, 
= rə = 0655/1921 = 325, 

Equation 28.45 can also be employed to compute the value Ñ? = P (where t = /F 


= 43.86 = 1.964). 


=p = —— = MUT. =- 325 (Equation 28.45) 
t? + df (1.964)? + 8 
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Where: df =n, + n, - 2 (which is the degrees of freedom for the f test for two inde- 
pendent samples) 


The fact that different measures of magnitude of treatment effect may not yield the same 
value is discussed in Section VI of the single-factor between-subjects analysis of variance. 
In the latter discussion it is noted that the computed value fj? is a biased estimate of the 
underlying population parameter n?, and an adjusted value (which is less biased) can be 
computed with Equation 21.44. The latter value is now computed for Example 11.1: 
Adjusted f = 1 - [MS MS,] = 1 - [16.2/21.34] = .24. (Where, MS, = SS,/ df, =192.1/9 
= 21.34.) Note that the value Adjusted Ñ? = .24 is closer to the value à? = .22 than the 
previously computed value f? = rs - .325. 


Test 28h-a: Test of significance for a point-biserial correlation coefficient The null 
hypothesis H,: Po = 0 can be evaluated with Equation 28.46 (which is identical to Equation 
28.3, which is employed to evaluate the same hypothesis with reference to the Pearson product- 
moment correlation coefficient). As is the case for Equation 28.3, the degrees of freedom for 
Equation 28.46 are df = n - 2. Employing Equation 28.46, the value t = 1.96 is computed. 


t = ————— = —*—_ = 1.96 (Equation 28.46) 


y1 - rs y1 657) 


It will be assumed that the nondirectional alternative hypothesis H,: p,, * O is evaluated. 
Employing Table A2, for df= 10 - 2 = 8, the tabled critical two-tailed .05 and .01 values are 
tos = 2.31 and f,, = 3.36. Since the obtained value ¢ = 1.96 is less than both of the 
aforementioned critical values, the null hypothesis H: pus 0 cannot be rejected. 

Since the value of the point-biserial correlation coefficient is a direct function of the 
difference between Y, and Y,, a significant difference between the latter two mean values 
indicates that the absolute value of the correlation between the two variables is significantly 
above zero. Thus, an alternative way of evaluating the null hypothesis H: Po = 0 isto conduct 


a t test for two independent samples, contrasting the two mean values Y, and Y,. The fact that 
the latter analysis will yield a result that is equivalent to that obtained with Equation 28.46 can 
be confirmed by the fact that the value t = 1.96 computed above with Equation 28.46 is identical 
to the absolute t value computed for the same set of data with Equation 11.1 (for Example 11.1). 
Sources that provide additional discussion of the point-biserial correlation coefficient are 
Guilford (1965), Lindeman et al. (1980), and McNemar (1969). 


Test 28i: The biserial correlation coefficient (7,) The biserial correlation coefficient is 
employed if both variables are based on an interval/ratio scale, but the scores on one of the 
variables have been transformed into a dichotomous nominal/categorical scale. An example of 
asituation where an interval/ratio variable would be expressed as a dichotomous variable is a test 
based on a normally distributed interval/ratio scale for which the only information available is 
whether a subject has passed or failed the test. The value computed for the biserial correlation 
coefficient represents an estimate of the value that would be obtained for the Pearson product- 
moment correlation coefficient, if, instead of employing a dichotomized variable, one had 
employed the scores on the underlying interval/ratio scale. 

The biserial correlation coefficient is based on the assumption that the underlying dis- 
tribution for both of the variables is continuous and normal. Since the accuracy of r, is highly 
dependent upon the assumption of normality, it should not be employed unless there is empirical 


© 2000 by Chapman & Hall/CRC 


evidence to indicate that the distribution underlying the dichotomous variable is normal. If the 
underlying distribution of the dichotomous variable deviates substantially from normality, the 
computed value of r, will not be an accurate approximation of the underlying population cor- 
relation r, estimates. One consequence of the normality assumption being violated is that, under 
certain conditions, the absolute value computed for r, may exceed 1. In point of fact, Lindeman 
et al. (1980) note that the theoretical limits of r, are -œ < r, < +e, 

In contrast to Typ? the sign of the biserial correlation coefficient should be taken into 
account, since it clarifies the nature of the relationship between the two variables. This is the 
case, since the dichotomous variable will involve two ordered categories. In assigning scores to 
subjects on the ordered dichotomized variable, the score 1 should be employed for the category 
associated with higher performance/quality, and the score 0 should be employed for the category 
associated with lower performance/quality. 

In order to illustrate the computation of the biserial correlation coefficient, let us assume 
a researcher wants to determine whether there is a statistical relationship between intelligence 
and eye-hand coordination. Ten subjects are categorized with respect to both variables. 
Although the evaluation of each subject’s intelligence is based on an interval/ratio intelligence 
test score, we will assume that the only information available to the researcher is whether an 
individual is above or below average in intelligence. In view of this, intelligence, which will be 
designated the X variable, will have to be represented as a dichotomous variable. Subjects who 
are above average in intelligence will be assigned a score of 1, and subjects who are below 
average in intelligence will be assigned a score of 0. The scores on the eye-hand coordination 
test will represent the Y variable. Example 28.5, which employs the same set of data as Example 
28.4, summarizes the above described experiment. 


Example 28.5 A study is conducted to determine whether there is a correlation between 
intelligence and eye-hand coordination. Five subjects who are above average in intelligence and 
five subjects who are below average in intelligence are administered a test of eye-hand coor- 
dination. The test scores of the subjects follow (the higher a subject’s score, the better his or her 
eye-hand coordination): Above average intelligence: 11, 1, 0, 2, 0; Below average intel- 
ligence: 11, 11, 5, 8, 4. Is there a statistical relationship between intelligence and eye-hand 
coordination? 


The biserial correlation coefficient can be computed with either Equation 28.47 or 28.48. 
It can also be computed with Equation 28.49 if um has been computed for same set of data. 
Note that except for h, all of the terms in the aforementioned equations are also employed 
in computing the point-biserial correlation coefficient. The value h represents the height 
(known more formally as the ordinate) of the standard normal distribution at the point which 
divides the proportions py and p,. Specifically, employing Table A1 the z value is identified 
that delineates the point on the normal curve that a proportion of the cases corresponding to the 
smaller of the two proportions p, versus p, falls above and the larger of the two proportions falls 
below. The tabled value of h (in Column 4 of Table A1) associated with that z value is 
employed in whatever equation one employs for computing r,. If, as is the case in our example, 
Po = B, = -5, the value of z will equal zero, and thus the corresponding value of h = .3989. 
When A = .3989 and the other appropriate values employed for Example 28.5 (which are 
summarized in Table 28.6) are substituted in Equations 28.47-28.49, the value r, = -.71 is 
computed. 


€ 2000 by Chapman & Hall/CRC 


(Equation 28.47) 
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pe, NEO, = ACG) Log (Equation 28.49) 
h 3989 


Note that for the same set of data, the absolute value r, = .71 is larger than the absolute 
value r,, = .57 computed for the point-biserial correlation coefficient. In point of fact, for 
the same set of data (except when r, = Fo = 0) the absolute value of r, will always be larger 


than the absolute value of Top since yPoP1 /h will always be larger than 1. The closer together 
the values py and p,, the less the discrepancy between the values of r, and r,,. If there is 
reason to believe that the normality assumption for the dichotomous variable has been violated, 
most sources recommend computing Top instead of r,, since r, may be a spuriously inflated 
estimate of the underlying population correlation. When the latter is taken into consideration, 
along with the fact that by dichotomizing a continuous variable one sacrifices valuable 
information, it can be understood why the biserial correlation coefficient is infrequently 
employed within the framework of research. 

Guilford (1965) notes that, given the normality assumption has not been violated, those 
conditions which optimize the likelihood of r, providing a good estimate of the underlying 
population parameter p, are as follows: a) The value of n is large; and b) The values of p, and 
p, are close together. It should be noted that (as is the case for Pearson r and r,,) if the 
relationship between two variables is nonlinear, the computed value of r, will only represent the 
degree of linear relationship between the variables. 


Test 28i-a: Test of significance for a biserial correlation coefficient Lindeman et al. (1980) 
note that the null hypothesis H): p, = 0 can be evaluated with Equation 28.50. Although the 
latter equation, which is based on the normal distribution, assumes a large sample size, it is 
employed to evaluate the value r, - -.71 computed for Example 28.5. Note that the sign of z 
will always be the same as the sign of r,. 


h E 
pec e e ED 2 96 (Equation 28.50) 


PoPi (.5)(.5) 
n 10 


Employing Table A1, the tabled critical two-tailed values are z o; = 1.96 and Z,, = 2.58, 
and the tabled critical one-tailed values are zo; = 1.65 and zy, = 2.33. The nondirectional 
alternative hypothesis H,: p, * O isnot supported, since the absolute value z = 1.79 is less than 
the tabled critical two-tailed value zo, = 1.96. However, the directional alternative hypothesis 
H,: p, < 0 is supported at the .05 level, since z = - 1.79 is a negative number with an absolute 
value that is greater than the tabled critical one-tailed value z, = 1.65. The moderately strong 
negative correlation between the two variables indicates that subjects who score below average 
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on the intelligence test perform better on the test of eye-hand coordination than do subjects who 
score above average on the intelligence test. The latter can be confirmed by the fact that 
Y, = 2.8 is less than Y, = 7.8.” 

As is the case for the point-biserial correlation coefficient, since the value of the biserial 
correlation coefficient is a direct function of the difference between Y, and Y,, a significant 
difference between the two mean values indicates that the absolute value of the correlation 
between the two variables is significantly above zero. Thus, an alternative way of evaluating 
the null hypothesis H,: p, = 0 is to contrast the means Y, and Y, with at test for two inde- 
pendent samples. However, the result obtained with Equation 28.50 will not necessarily be 
consistent with the result obtained if the ¢ test for two independent samples is employed to 
contrast Y, versus Y, (especially if the sample size is small). In point of fact, use of the t test 
for two independent samples to contrast Y, versus Y, assumes the use of r „ as a measure of 
association. Within the context of employing the f test, the correlational example under dis- 
cussion can be conceptualized as a study in which intelligence represents the independent 
variable and eye-hand coordination the dependent variable. The independent variable, which is 
nonmanipulated, is comprised of the two levels, above average intelligence versus below 
average intelligence. Sources that provide additional discussion of the biserial correlation 
coefficient are Guilford (1965), Lindeman et al. (1980), and McNemar (1969). 


Test 28j: The tetrachoric correlation coefficient (ri) The tetrachoric correlation coef- 
ficient is employed if both variables are based on an interval/ratio scale, but the scores on both 
of the variables have been transformed into a dichotomous nominal/categorical scale. The value 
computed for the tetrachoric correlation coefficient represents an estimate of the value one 
would obtain for the Pearson product-moment correlation coefficient if, instead of employing 
dichotomized variables, one had used the scores on the underlying interval/ratio scales. The 
tetrachoric correlation coefficient (which was developed by Karl Pearson (1901)) is based on 
the assumption that the underlying distribution for both of the variables is continuous and normal. 
Among others, Cohen and Cohen (1983) note that caution should be employed in using both the 
tetrachoric and biserial correlation coefficients, since both measures are based on hypothetical 
underlying distributions that are not directly observed. Since the accuracy of r „is highly depen- 
dent upon the assumption of normality, it should not be employed unless there is empirical 
evidence to indicate that the distributions underlying the dichotomous variables are normal. 

Since the magnitude of the standard error of estimate of ,,, is large relative to the standard 
error of estimate of Pearson r, in order to provide a reasonable estimate of r, the sample size 
employed for computing r,., should be quite large. Guilford (1965) and Lindeman et al. (1980) 
state that the value of n employed in computing r,,, should be at least two times that which would 
be employed to compute r. As is the case for the Pearson product-moment correlation coef- 
ficient, the following apply to the tetrachoric correlation coefficient: a) The range of values 
within which r,, canfallis -1 < r,, < +1; and b) If the relationship between two variables is 
nonlinear, the computed value of r,,, will only represent the degree of linear relationship between 
the variables. 

Earlier in this section (as well as in the discussion of the chi-square test for r x c tables) 
it is noted that the phi coefficient (ọ) is also employed as a measure of association for a 2 x 2 
contingency table involving two dichotomous variables. The basic difference between r,,, and 
¢, is that the latter measure is employed with two genuinely dichotomous variables (i.e., variables 
that are not based on an underlying distribution involving an interval/ratio scale). Cohen and 
Cohen (1983) and McNemar (1969) note that the value of r,,, computed for a 2 x 2 contingency 
table will always be larger than the value of ọ computed for the same data.” 

A number of reasons account for the fact that the tetrachoric correlation coefficient is 
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infrequently employed within the framework of research. One reason is that, in most instances, 
data on variables that represent an interval/ratio scale are available in the latter format, and thus 
there is no need to convert it into a dichotomous format. Another reason is the reluctance of 
researchers to accept the normality assumption with respect to variables for which only dichoto- 
mous information is available. 

Without the aid of a computer or special tables (which can be found in Guilford (1965) and 
Lindeman et al. (1980)), the computation of the exact value of r „ is both time consuming and 
tedious. There are, however, two equations that have been developed which provide reasonably 
good approximations of r,, under most conditions. These equations will be employed to 
evaluate Example 28.6. 


Example 28.6 Two hundred subjects are asked whether they Agree (which will be assigned a 
score of 1) or Disagree (which will be assigned a score of 0) with the following two statements: 
Question 1: / believe that abortion should be legal. Question 2: I believe that murderers 
should be executed. The responses of the 200 subjects are summarized in Table 28.8. Is there 
a statistical relationship between subjects' responses to the two questions? 


Table 28.8 Summary of Data for Example 28.6 





X variable 
Question 1 
0 1 
Disagree Agree Row Sums 
n 100 
Y variable Disagree 
Question 2 1 
100 
Agree 
Column Sums 90 110 Total = 200 


Subjects’ responses to Question 1 will represent the X variable, and their responses to 
Question 2 will represent the Y variable. The use of the tetrachoric correlation coefficient in 
evaluating the data is based on the assumption that the permissible responses Agree versus 
Disagree represent two points that lie on a continuous scale. It will be assumed that if subjects 
are allowed to present their opinions to the questions with more precision, their responses can 
be quantified on an interval/ratio scale, and that the overall distribution of these responses in the 
underlying population will be normal. Thus, the responses of 0 and 1 on each variable are the 
result of dichotomizing information that is based on an underlying interval/ratio scale. 

Equations 28.51 and 28.52 can be employed to compute reasonably good approximations 
of the value of r... Lindeman et al. (1980) note that Equation 28.51 provides a good approx- 
imation of r „ when py = p, = .5 for both of the dichotomous variables. In other words, for 
the X variable both D, = = (a + c)/n and Py. = = (b + d)/n will equal .5, and for the Y variable 
both Po, = = (a + bin and Pi, = =(c + dyin will equal .5 (wheren =a+b+c+d). Equation 
28.52, on the other hand, is recommended when the values of Po and p, are not equal. As the 
discrepancy between p, and p, increases, the less accurate the approximation provided by 
Equation 28.52 becomes. 

In both Equations 28.51 and 28.52, a and d will always represent the frequencies of cells 
in which subjects provide the same response for both variables/questions, and b and c will 
always represent the frequencies of cells in which subjects provide opposite responses for the 
two variables/questions. Inspection of Table 28.8 (which is identical to Table 28.3, which is 
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employed to illustrate that @ is a special case of r) indicates the following: 1) a = 30 subjects 
respond Disagree to both questions; 2) d = 40 subjects respond Agree to both questions; 
3) b = 70 subjects respond Agree to Question 1 and Disagree to Question 2; and 4) c = 60 
subjects respond Disagree to Question 1 and Agree to Question 2. 

The configuration of the data is such that py = p, for the Y variable, and the relationship 
is closely approximated for the X variable. Specifically, for the Y variable, Po, = (a + b)/n 


= (30 + 70)/200 = .5 and B, = (c + dyn = (60 + 40/200 2.5. In the case of the X variable, 
Po, = (a + c)/n = (30 + 60)/200 = .45 and p (b + d/n = (70 + 40200 = .55. 
The appropriate values are substituted in Equations 28.51 and 28.52 below. The trigono- 


metric functions in each of the equations can be easily calculated with one keystroke on most 
scientific calculators. 





(Equation 28.51) 
Tu, = sin|90* [ee | = sno one) | = sin -27° = -.45 
n 
(Equation 28.52) 
r = cos _ 180" = Cos — 10 = cos 117.30° = -.46 


tet 
p ad 1. | 040) 
bc (70)(60) 

Since for both variables the condition pj = p, = .5 is present or approximated, the two 
equations result in almost identical values. The negative sign in front of the correlation coef- 
ficient reflects the fact that subjects who are in one response category on one variable are more 
likely to be in the other response category on the other variable. A positive correlation would 
indicate that subjects tend to be in the same response category on both variables. Note that the 
absolute value r,,, = -45 (or .46) is greater than the value ọ = .30 obtained for the same set of 
data. This is consistent with what was noted earlier — that the value of r,,, computed fora 2 x 2 
table will always be larger than the value of ọ computed for the same data. The reader should 
take note of the fact, however, that unlike r,,,, @ is always expressed as a positive number. 
Thus, in comparing the two values for the same set of data, the absolute value of r,,, should be 


employed. 


Test 28j-a: Test of significance for a tetrachoric correlation coefficient In order to 
evaluate the null hypothesis H): P, = 0, the standard error of estimate of r,,, must first be 
computed employing Equation 28.53. In the latter equation, the values hy and hy, are the 
height (ordinate) of the standard normal distribution at the point for each of the variables which 
divides the proportions p, and p,. The protocol for determining the ordinate for each of the 
variables is identical to that employed for determining the ordinate for the biserial correlation 
coefficient. Employing Table A1 for both the X and the Y variables, the z value is identified 
which delineates the point on the normal curve that a proportion of cases corresponding to the 
smaller of the two proportions p, versus p, falls above and the larger of the two proportions falls 
below. The corresponding ordinate (in Column 4 of Table A1) is then determined. Thus, in the 
case of the X variable, A, = .3958 (since 45% of cases fall above the corresponding value 
z=.128). (The value A, = .3958 is interpolated from Table A1.) In the case of the Y variable, 
hy = .3989 (since 50% of cases fall above the corresponding value z = 0). When the appropri- 
ate values are substituted in Equation 28.53, the value Ss .111 is computed. 
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s = MP PuPo Py CA3)055)0505.— 111 (Equation 28.53) 


r 


D hy hy Vn (.3958)(.3989)/200 


The value s, = .111 is substituted in Equation 28.54, which is employed to evaluate 


the null hypothesis Hy: Fe = 0. Use of the normal distribution in evaluating the null hypothe- 
sis assumes that the computation of r,,, is based on a large sample size (since, as noted earlier, 


Tet Will be extremely unreliable if it is based on a small sample). Employing Equation 28.54, 
the value z = -4.14 is computed. Note that the sign of z will always be the same as the sign of 


Fiet : 


go. 7.414 (Equation 28.54) 


It will be assumed that the nondirectional alternative hypothesis H,: p,, # O is evaluated. 
Employing Table A1, it is determined that the tabled critical two-tailed .05 and .01 values are Z 9; = 1.96 
and zy, = 2.58. Since the obtained absolute value z = 4.14 is greater than both of the 
aforementioned critical values, the nondirectional alternative hypothesis H,: p,, * 0 is 
supported at both the .05 and .01 levels. Additional discussion of the tetrachoric correlation 
coefficient can be found in Guilford (1965), Lindeman et al. (1980), and McNemar (1969). 


2. Multiple regression analysis This section of the Addendum will describe the following 
measures: a) The multiple correlation coefficient (Test 28k); b) The partial correlation 
coefficient (Test 281); and c) The semi-partial correlation coefficient (Test 28m). The use of 
the term multivariate implies that data for three or more variables are employed in the analysis. 
The measures that are described are extensions of the Pearson product-moment correlation 
coefficient to an analysis involving three or more variables. All of the measures (each of which 
is discussed in reference to an analysis involving three variables) assume that all of the variables 
are measured on an interval/ratio scale. A brief description of the measures to be described 
follows. 

The multiple correlation coefficient (Test 28k) The multiple correlation coefficient 
(R)is acorrelation between a criterion variable and a linear combination of two or more predictor 
variables. 

The partial correlation coefficient (Test 281) The partial correlation coefficient (e.g., 
Ty. x) measures the degree of association between two variables, after any linear association 
one or more additional variables has with the two variables has been removed. 

The semi-partial correlation coefficient (Test 28m) Thesemi-partial correlation coef- 
ficient (or part correlation coefficient) (e.g., Tyx,.x,)) measures the degree of association 
between two variables, with the linear association of one or more other variables removed from 
only one of the two variables that are being correlated with one another. 


General introduction to multiple regression analysis Multiple regression analysis is 
the term employed to describe the use of correlation and regression with designs involving more 
than two variables. Such analysis, which is considerably more complex than bivariate analysis, 
involves laborious computational procedures that make it all but impractical to conduct without 
the aid of acomputer. Although this section will provide the reader with an overview of multiple 
regression analysis, the discussion to follow is not intended to provide comprehensive coverage 
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of the subject. For a more thorough discussion of multiple regression, the reader should consult 
sources on multivariate analysis (e.g., Marascuilo and Levin (1983), Stevens (1986, 1996), and 
Tabachnick and Fidell (1989, 1996)). 

In contrast to simple linear regression, where scores on one predictor variable are employed 
to predict the scores on a criterion variable, in multiple regression analysis a researcher attempts 
to increase the accuracy of prediction through the use of multiple predictor variables. By em- 
ploying multiple predictor variables, one can often account for a greater amount of the variability 
on the criterion variable than will be the case if only one of the predictor variables is employed. 
Thus, the major goal of multiple regression analysis is to identify a limited number of predictor 
variables which optimize one's ability to predict scores on a criterion variable. 

Since researchers generally want the simplest possible predictive model (as well as the fact 
that from a time and cost perspective, a model that involves a limited number of variables is less 
costly and easier to implement), it is unusual to find a multiple regression model that involves 
more than five predictor variables. Two additional factors which limit the number of predictor 
variables derived in multiple regression analysis follow: a) Once a limited number of predictor 
variables has been identified that explains a relatively large proportion of the variability on the 
criterion variable, it becomes increasingly unlikely that any new predictor variables which are 
identified will result in a significant increase in predictive power; and b) Although the researcher 
wants to identify predictor variables that are highly correlated with the criterion variable, he also 
wants to make sure that the predictor variables employed account for different proportions of the 
variability on the criterion variable. In order to accomplish the latter, none of the predictor 
variables should be highly correlated with one another, since, if the latter is true, the variables 
will be redundant with respect to the variation on the criterion variable they explain. Asa 
general rule, it is difficult to find a large number of predictor variables that are highly correlated 
with a criterion variable, yet not correlated with one another. The term multicollinearity is used 
to describe a situation where predictor variables have a high intercorrelation with one another. 
When multicollinearity exists, the reliability of multiple regression analysis may be severely 
compromised. 

Within the framework of multiple regression analysis there are a variety of strategies that 
are employed in selecting predictor variables. Among the strategies that are available are the 
following: a) Forward selection — In forward selection, predictor variables are determined 
one at a time with respect to the order of their contribution in explaining variability on the 
criterion variable; b) Backward selection — In backward selection, the researcher starts with 
alarge pool of predictor variables and, starting with the smallest contributor, eliminates them one 
at a time, based on whether or not they make a significant contribution to the predictive model; 
c) Stepwise regression — Stepwise regression is a combination of the forward and backward 
selection methods. In stepwise regression, one employs the forward selection method, but upon 
adding each new predictor variable, all of the remaining predictor variables from the original 
pool are reexamined to determine whether they should be retained in the predictive model; d) 
Hierarchical regression — Whereas in the three previous methods statistical considerations 
dictate which predictor variables are included in the model, in hierarchical regression the 
researcher determines which variables should be included in the model. The latter determination 
is based on such factors as logic, theory, results of prior research, and cost; and e) Standard or 
direct regression — In standard/direct regression, all available predictor variables are 
included in the model, including those that only explain a minimal amount of variability on the 
criterion variable. This type of regression may be employed when a researcher wants to explore, 
for theoretical or other reasons, the relationship between a large set of predictor variables and a 
criterion variable. Standard/direct regression is atypical when compared with the other methods 
of regression analysis, in that it is more likely to result in a large number of predictor variables. 
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Upon conducting a multiple regression analysis, it is recommended that the resulting 
regression model be cross-validated. Minimally, this means replicating the results of the analysis 
on two subsamples, each representing a different half of the original sample. An even more de- 
sirable strategy is replicating the results on one or more independent samples that are repre- 
sentative of the population to which one wishes to apply the model. By cross-validating a model, 
one can demonstrate that it generates consistent results, and will thus be of practical value in 
making predictions among members of the reference population upon which the model is based. 

Although multiple regression analysis may result in a mechanism for reasonably accurate 
predictions, it does not provide sufficient control over the variables under study to allow a 
researcher to draw conclusions with regard to cause and effect. As is the case with bivariate 
correlation, multivariate correlation is not immune to the potential impact of extraneous variables 
that may be critical in understanding the causal relationship between the variables under study. 
In order to demonstrate cause and effect, one must employ the experimental method (specifically, 
through use of the true experiment, demonstrate a causal connection between scores on a 
dependent variable and a manipulated independent variable). It should be noted that there is a 
procedure called path analysis (which will not be described in this book) that employs cor- 
relational information to evaluate causal relationships between variables. Statisticians are not 
in agreement, however, with respect to what role path analysis (as well as a number of related 
procedures) should play in making judgements with regard to cause and effect in correlational 
research. 

Multiple regression analysis has a number of assumptions which if violated can compromise 
the reliability of the results. These assumptions (which concern normality, linearity, homosce- 
dasticity, and independence of the residuals), as well as the impact of outliers on a multiple 
regression analysis, are discussed in books that provide comprehensive coverage of the subject. 


Computational procedures for multiple regression analysis involving three variables 


Test 28k: The multiple correlation coefficient Within the framework of multiple regression 
analysis, a researcher is able to compute a correlation coefficient between the criterion variable 
and the k predictor variables (where k > 2). The computed multiple correlation coefficient 
is represented by the notation R. A computed value of R must fall within the range 0 to +1 (1.e., 
O< R x +1). Unlike an r value computed for two variables, the multiple correlation coefficient 
cannot be a negative number. The closer the value of R is to 1, the stronger the linear 
relationship between the criterion variable and the k predictor variables, whereas the closer it is 
to 0, the weaker the linear relationship. Because of the complexity of the computations involved, 
the discussion of the computational procedures for multiple regression analysis will be restricted 
to designs involving two predictor variables (which in the examples to be discussed will be desig- 
nated X, and X,), and a criterion variable (which will be designated Y). When k — 2, Equation 
28.55 is employed to compute the value of R. The notation Ry x, x, Tepresents the multiple 
correlation coefficient between the criterion variable Y and the linear combination of two 
predictor variables X, and X,. 





2 
Fw c 2Wylyl 
| RCM | See (Equation 28.55) 


2 
1 - Ty x, 


The following example will be employed to illustrate the use of Equation 28.55. Assume 
that the following correlation coefficients have been computed: a) The correlation between the 
number of ounces of sugar a child eats (which will represent predictor variable X,) and the 
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number of cavities (which will represent the criterion variable Y) is ry, - .955; b) The 
1 


correlation between the number of ounces of salt a child eats (which will represent predictor 
variable X,) and the number of cavities is ry, = .52; and c) The correlation between the num- 
2 


ber of ounces of sugar a child eats and the number of ounces of salt a child eats is Ty x, = i337. 
The three above noted correlations are based on the following data: a) Table 28.1 lists the 
number of ounces of sugar a child eats ( X, ) and the number of cavities (Y); and b) The following 
values, which are used to compute the correlations Tyy, = .52 and Ty x, = .37, are employed 
to represent the number of ounces of salt eaten (X,) by the five subjects in Table 28.1: 4, 1, 1, 
3, 6. Employing the latter set of five scores, the mean and estimated population standard 
deviation for variable X, are X, - 15/5 - 3 and Sy = [63 - ((15)/5)]/4 =2.12. 

(7955, ry, = -52,and ry y = .37 in Equation 28.55, 


the multiple correlation coefficient Ry xx, = .972 is computed. 


Substituting the correlations ry, 





_ | (955)? + (52)? - 2(.955)(.52)(.37) _ E 
Ry y x, = [SS ee n = y.944 = .972 


Note that the value Ry y y = .972 is larger than either value that is computed when each 
ieee aa’ 
of the predictor variables is correlated separately with the criterion variable. Of course, the value 


of Ry x y can be only minimally above r,, = .955, since the maximum value R can attain is 
Dep 1 
1. 


The coefficient of multiple determination R?, which is the square of the multiple correla- 
tion coefficient, is referred to as the coefficient of multiple determination. The coefficient of 
multiple determination indicates the proportion of variance on the criterion variable that can 
be accounted for on the basis of variability on the k predictor variables. In our example 
Ry x x, = ( .972)° = .944. In point of fact, R? is a biased measure of P?, which is the population 
parameter it is employed to estimate (P is the upper case Greek letter rho). The degree to which 
the computed value of R? is a biased estimate of P? will be a function of the sample size and the 
number of predictor variables employed in the analysis. The value of R? will be spuriously 
inflated when the sample size is close in value to the number of predictor variables employed in 
the analysis.” For this reason, sources emphasize that the number of subjects employed in a 
multiple regression analysis should always be substantially larger than the number of predictor 
variables. For example, Marascuilo and Levin (1983) recommend that the value of n should be 
at least ten times the value of k.” 

One way of correcting for bias resulting from a small sample size is to employ Equation 


28.56 to compute R . , Which is a relatively unbiased estimate of P?. The value R is commonly 
referred to as a "shrunken" estimate of the coefficient of multiple determination." 


52 


| Q- R?)n- 1) 


R =1 (Equation 28.56) 
n-k-1 
Itcan be seen below that substituting the values from our example in Equation 28.56 yields 
the value m XX, ^ .89, which is lower than Ry x,x, ^ -944 computed with Equation 28.55 


(Ry xx = .944 is the value in the radical of Equation 28.55 prior to computing the square root). 


52 


B .,- H-C9MJS - 1] .—- 


Y.X, X, 5-2-1 89 
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Test 28k-a: Test of significance for a multiple correlation coefficient Equation 28.57 is 
employed to evaluate the null hypothesis H,: P? = 0. Ifthe latter null hypothesis is supported, 
the value of P will also equal zero and, consequently, the null hypothesis Hy: P = 0 is also 
supported. 


(n - k - DR? 
k(l - R2) 


F= (Equation 28.57) 
The computed value R; x,x, = 944, as well as the shrunken estimate E ex = .89, is 
substituted in Equation 28.57 below. If one has to choose which of the two values to employ, 


researchers would probably consider it more prudent to employ the shrunken estimate Re X, X, 
= .89 (especially when the sample size is small). 


p. 6G -2- D944) 


- 16.86 
2(1 - .944) 

0592 510089). ge 
20 - .89) 


The computed F value is evaluated with Table A10 (Table of the F Distribution) in the 
Appendix. In order to reject the null hypothesis, the F value must be equal to or greater than the 
tabled critical value at the prespecified level of significance. The degrees of freedom employed 
for the analysis are df... = K and dfin =n - k 2 1. Thus, for our example df um = 2 and 


num num 


dfin =I - 2- 1=2. In Table A10, for df m = 2 and dfien = 2, the tabled critical .05 and 
.01 values are Fy, = 19.00 and F,, = 99.00. Since both of the obtained values F = 16.86 and 
F = 8.09 are less than F, = 19.00, regardless of whether R? or R° is computed, the null 
hypothesis cannot be rejected. Thus, in spite of the fact that the obtained value of R is close to 
1, the data still do not allow one to conclude that the population multiple correlation coefficient 
is some value other than zero. The lack of significance for such a large R value can be attributed 


to the small sample size. 


The multiple regression equation A major goal of multiple regression analysis is to derive a 
multiple regression equation that utilizes scores on the k predictor variables to predict the 
scores of subjects on the criterion variable. Equation 28.58 is the general form of the multiple 
regression equation. 


Y =a+b,X, + b,X, + + bX, (Equation 28.58) 


Note that the multiple regression equation contains a regression coefficient (b,) for each 
of the predictor variables (X,, X,, ..., X,) and a regression constant (a). In contrast to the 
regression line employed in simple linear regression, the multiple regression equation describes 
a regression plane that provides the best fit through a set of data points that exists in a multi- 
dimensional space. The values computed for the regression equation minimize the sum of the 
squared residuals, which in the case of multiple regression are the sum of the squared distances 
of all the data points from the regression plane. 

Equations 28.59 and 28.60 can be employed to determine the values of the regression 
coefficients b, and b,, which are the coefficients for predictor variables X, and X,. Each of the 
regression coefficients indicates the amount of change on the criterion variable Y that will be 
associated with a one unit change on that predictor variable, if the effect of the second predictor 
variable is held constant. 
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[s dre rr 

b, = Z Do o a (Equation 28.59) 
[°x | 1 - TX, X, 
le dr -ror 

b, = | ii ae (Equation 28.60) 
[x | Iss P X, 


Equation 28.61 is employed to compute the regression constant a (which is analogous to 
the Y intercept computed in simple linear regression). 


a=Y-b X, -b,X, (Equation 28.61) 


The multiple regression equation will now be computed. From earlier discussion we 
know the following values: — X, = 7.2, $, = 858, X, =3, $, = 2.12, Y = 3.4, 
1 2 
$y, = 2.70, ry, = .955, ry, = .52, Fy y = .37. Substituting the appropriate values in Equa- 
1 2 1:52. 


tions 28.59-28.61, the multiple regression equation is determined below. 


p, - [270 
8.58 


zi [ed 
5 = B 
a = 3.4 - (.278)(7.2) - (246)(3) = .660 


Y = .660 + .278X, + .246X, 


To illustrate the application of the multiple regression equation, when the appropriate values 
are substituted in Equation 28.58, a child who consumes 4 ounces of sugar (X,) and 2 ounces 
of salt (X,) per week is predicted to have 2.264 cavities. 





955° 65203) |. ze 
1 - (37) l 





32 - CISS3D | _ 246 
1 - (30 





Y' = .660 + (.278)(4) + (.246)(2) = 2.264 


The standard error of multiple estimate As is the case with simple linear regression, a stand- 
ard error of estimate can be computed which can be employed to determine how accurately the 
multiple regression equation will predict a subject’s score on the criterion variable. Employing 
this error term, which in the case of multiple regression is referred to as the standard error of 
multiple estimate, one can compute a confidence interval for a predicted score. The standard 
error of multiple estimate will be represented by the notation s, XX Equation 28.62 is 
employed to compute Sy y x, if R? (the biased estimate of P?) is used to represent the 
coefficient of multiple determination 5 If, on the other hand, the unbiased estimate R’ is 
employed to represent the coefficient of multiple determination, Equation 28.63 can be used 
to compute Sy XX, The two equations are employed below to compute the value s, xx 7 .90. 


(Equation 28.62) 






Su z RP) = (2.70) E - .944) = .90 


n-k-1 
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Sy xx, = Syl - Ryu = 2.70)/T -.89 = .90 (Equation 28.63) 


Computation of a confidence interval for Y' Equation 28.64 (which is analogous to Equation 
28.16) can be employed to compute a confidence interval for the predicted value Y’. The value 
Sy xx = .90 is employed in Equation 28.64 to compute the 95% confidence interval for the 


subject who is predicted to have Y' = 2.264 cavities. Also employed in the latter equation is the 
tabled critical two-tailed .05 t value £9, = 4.30 (for df=n -k- 1=5-2- 1=2), which 
delineates to the 95% confidence interval. 


(Equation 28.64) 


Ch, ay = Y! E Gyn) yx x) = 2.264 + (4.30)(.90) = 2.264 + 3.87 


(0-0) 

This result indicates that the researcher can be 9596 confident (or the probability is .95) that 
the number of cavities the subject actually has falls within the range - 1.606 and 6.134 (i.e., 
- 1.606 < Y < 6.134). Since a person cannot have a negative number of cavities, the latter result 
indicates that the person will have between 0 and 6.134 cavities. 


Evaluation of the relative importance of the predictor variables If the result of a multiple 
regression analysis is significant, a researcher will want to assess the relative importance of the 
predictor variables in explaining variability on the criterion variable. It should be noted that 
although the value computed for the multiple correlation coefficient is not significant for the 
example under discussion, within the framework of the discussion of the material in this section, 
it will be assumed that it is. Intuitively, it might appear that one can evaluate the relative impor- 
tance of the predictor variables based on the relative magnitude of the regression coefficients. 
However, since the different predictor variables represent different units of measurement, com- 
parison of the regression coefficients will not allow a researcher to make such an estimate. One 
approach to solving this problem is to standardize each of the variables so that scores on all of 
the variables are based on standard normal distributions. As a result of standardizing all of the 
variables, Equation 28.65 (which is referred to as the standardized multiple regression 
equation) becomes the general form of the multiple regression equation. 


Zy = Bizi + Bazy +e + Bezk (Equation 28.65) 


In Equation 28.65 the predicted value Y', as well as the scores on the predictor variables 
X, and X, are expressed as standard deviation scores (i.e., zy £5). The standardized 
equivalent of a regression coefficient, referred to as a beta weight, is represented by the notation 
D;. Since the regression constant a will always equal zero in the standardized multiple regres- 
sion equation, it is not included. When there are two predictor variables, Equations 28.66 and 
28.67 can be employed to compute the values of D, and B,. Note that each equation is ex- 
pressed in two equivalent forms, one form employing the regression coefficients and the relevant 
estimated population standard deviations, and the other form employing the correlations between 
the three variables. 


r. = Wr 
gean Mio (Equation 28.66) 


2 
1 - Ty x, 
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Sy, 


Sy 


F: mr 
nS EU (Equation 28.67) 


2 
1 - Ty x, 


B, = b, 








Substituting the appropriate values in Equations 28.66 and 28.67, the values B, = .883 and 
B, = .193 are computed. 


B, = PEE siete OOM), 2.788 
2.70 [1 + (.37)7] 
A no _ 252 - C955)030 _ 
209 [1 - (37y] 
Thus, the standardized multiple regression equation is as follows: z, = .883z, + .193z,. 


The value D, - .883 indicates that an increase of one standard deviation unit on variable X, 
is associated with an increase of .883 standard deviation units on the criterion variable (if the 
predictor variable X, remains at a fixed value). In the same respect the value B, = .193 indi- 
cates that an increase of one standard deviation unit on variable X, is associated with an increase 
of .193 standard deviation units on the criterion variable (if the predictor variable X, remains 
at a fixed value). 

When there are two predictor variables, Equation 28.68 employs the standardized beta 
weights to provide an alternative method for computing the value Ry XX, The value derived 


With Equation 28.68 will be equivalent to the square of the value computed with Equation 28.55. 
2 — 
YxXX ^ Biryy, = Bary, 

(Equation 28.68) 


Ry xx, = (-883)(.955) + (.193)(.52) = .944 


Some sources note that the absolute values of the beta weights reflect the rank-ordering of 
the predictor variables with respect to the role they play in accounting for variability on the 
criterion variable. Kachigan (1986) suggests that by dividing the square of a larger beta weight 
by the square of a smaller beta weight, a researcher can determine the relative influence of two 
predictor variables on the criterion variable. The problem with the latter approach (as Kachigan 
(1986) himself notes) is that beta weights do not allow a researcher to separate the joint contri- 
bution of two or more predictor variables. The fact that predictor variables are usually correlated 
with one another (as is the case in the example under discussion), makes it difficult to determine 
how much variability on the criterion variable can be accounted for by any single predictor 
variable in and of itself. In view of this, Howell (1992, 1997) and Marascuilo and Serlin (1988) 
note that statisticians are not in agreement with respect to what, if any, methodology is 
appropriate for determining the precise amount of variability attributable to each of the predictor 
variables. 


Evaluating the significance of a regression coefficient In conducting a multiple regression 
analysis itis common practice to evaluate whether each of the regression coefficients is statistic- 
ally significant. Since the unstandardized/raw score coefficients and standardized coefficients 
are linear transformations of one another, a statistical test on either set of coefficients will yield 
the same result. The null hypothesis that is evaluated is that the true value of the regression coef- 
ficient in the population equals zero. Thus, Hy: B; = 0 (where B,, which is the upper case Greek 
letter beta, represents the value of the coefficient for the i” predictor variable in the underlying 
population). For the purpose of discussion it will be assumed that the aforementioned null 
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hypothesis is stated in reference to the unstandardized coefficients. In order to evaluate the null 
hypothesis Hy: B; = 0,a standard error of estimate must be computed for a coefficient. When 
there are two predictor variables, the standard error of estimate for an unstandardized coefficient 
(represented by the notation s b) is computed with Equation 28.69.” 





$ pede 
S H : 2 . 
Sy = d GE NN (Equation 28.69) 


C Sy ah = rg x) n -k-1) 


Employing Equation 28.69, the values Sp = .057 and s, = .229 are computed. 


5 = 2:10 1 - .944 — 
^  &58N [1 - C37Y][5 - 2 - 1] 


| 2.70 1 - .944 


Mi S IARE mum 
2.25 It —(37)1[5 - 2 - 1] 





= ,229 


Each of the values sp, = .057 and Sp, = .229 can be substituted in Equation 28.70, 


which employs the t distribution to evaluate the null hypothesis H;: B; = 0.* In the analysis 

to be described it will be assumed that the nondirectional alternative hypothesis H,: B; # 0 is 

evaluated for each regression coefficient. The degrees of freedom employed in evaluating a t 

value computed with Equation 28.70 are df = n - k - 1. Thus, df25-2- 1 z2. If the null 

hypothesis cannot be rejected for a specific coefficient, the researcher can conclude that the 

predictor variable in question will not be of any use in predicting scores on the criterion variable. 

bi . 

[o = (Equation 28.70) 

i Si. 

Employing Equation 28.70, the null hypotheses Hy: B, = 0 and H,: B, = 0 are evalu- 

ated.?! 


QUEE ABB dies -107 


b ^ 057 b 299 


Employing Table A2, for df = 2, the tabled critical .05 and .01 t values are ty, = 4.30 
and ty, = 9.93." Since the value h, = 4.88 is greater than ft), = 4.30, the nondirectional 


alternative hypothesis H,: B, # O is supported at the .05 level. It is not supported at the .01 
level, since ty = 4.88 is less than t,, = 9.93. Since ty, = 1.07 is less than £9, = 4.30, the 
nondirectional alternative hypothesis H,: B, # 0 isnot supported. Thus, we can conclude that, 
whereas predictor variable X, (sugar consumption) contributes significantly in predicting vari- 
ability on the criterion variable (number of cavities), predictor variable X, (salt consumption) 
does not. Consequently, the latter predictor variable can be removed from the analysis. It should 
be noted that when a researcher elects to eliminate a predictor variable from the analysis, a new 
regression equation should be derived which just involves the data for the remaining predictor 
variable(s). 
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Computation of a confidence interval for a regression coefficient Equation 28.71 can be 
employed to compute a confidence interval for a regression coefficient. 


CI, - b, + (tS) (Equation 28.71) 


fa - a) 
Using Equation 28.71, the 95% confidence intervals for the two regression coefficients are 
computed. The ¢ value employed in Equation 28.71 is t,, = 4.30, which is the tabled critical 
two-tailed .05 t value for df=n -k-125-2-1z2. 


CI, = 278 


bi 


H+ 


(4.30)(.057) = .278 + .245 


CI, = -246 


b 


+ 


(4.30)(.229) = .246 + .985 


Thus, the range of values in which the researcher can be 95% confident (or the probability 
is .95) the true values of the coefficients lie are as follows: .033 < B, < .523 and -.739 < B, 
< 1.231. 


Partial and semipartial correlation Within the framework of multiple regression analysis 
there are a number of other types of correlation coefficients which can be computed that can 
further clarify the nature of the relationship between predictor variables and a criterion variable. 
In evaluating the relationship between a criterion variable and a single predictor variable, it is not 
uncommon that the correlation is influenced by a third variable. As an example, the relationship 
between frequency of violent crimes (which will represent the criterion variable Y) and level of 
stress (which will represent the predictor variable X, ) is undoubtedly influenced by extraneous 
variables such as social class, which if included in a multiple regression analysis can represent 
a second potential predictor variable X,. In some instances, by measuring the influence of a third 
variable (in this case social class), a researcher will be better able to understand the nature of the 
relationship between the other two variables (violent crime and stress). By allowing the third 
variable to serve in the role of a meditating variable, it often increases the researcher’s ability to 
predict the scores of subjects on the criterion variable. In other instances, however, the 
researcher may view the contribution of a third variable as interfering with the study of the 
relationship between the other two variables. Thus, if a researcher wants to obtain a “purer” 
measure of the relationship between violent crime and stress, he might want to eliminate the 
influence of social class from the analysis. In instances where one wants to control for the 
influence of an extraneous variable, the latter variable is viewed as a nuisance variable. 
Fortunately, correlational procedures have been developed which allow researchers to 
statistically control for the influence of extraneous variables. Two of these procedures, partial 
correlation and semipartial correlation (also referred to as part correlation), are described in 
this section. 


Test 281: The partial correlation coefficient A partial correlation coefficient allows a re- 
searcher to measure the degree of association between two variables, after any linear association 
one or more additional variables have with the other two variables has been removed. Partial 
correlation reverses that which multiple correlation accomplishes. Whereas multiple correlation 
combines variables in order to assess their cumulative effect, partial correlation removes the 
effects of variables in order to determine what effect remains when one or more of the variables 
have been eliminated. 

In the case of two predictor variables X, and X, and the criterion variable Y, the partial 
correlation coefficient "yy _ x, Tepresents the correlation between Y and X, after any linear assoc- 


iation that X, has with either Y or X, has been removed. It can also be stated that the partial 
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correlation coefficient r,, y represents the correlation between Y and X, if X, is held con- 
12 


stant. By computing the partial correlation coefficient r one is able to have a “purer” 


YX,. Xy 
measure of the relationship between a criterion variable Y and the predictor variable X,. 
When there are three variables, it is possible to compute the following partial correlation 
coefficients: Tyy ox "yx, xo Ix xv Equation 28.72 is the general equation for computing a 
partial correlation coefficient involving three variables (where A, B, and C represent the three 
variables). The notation r,, ç represents the correlation between A and B, after any linear rela- 
tionship C has with A and B has been removed. 
r 


= haar 
Mpc ^ BÀ (Equation 28.72) 


Ja - rio - rio 





Employing Equation 28.72, the partial correlation coefficients Tyy, x, (which is the par- 
tial correlation of Y and X,, with the effect of X, removed) and Ty, x, (which is the partial 
correlation of Y and X,, with the effect of X, removed) will be computed. When the appropriate 
correlations from the example involving the relationship between sugar and salt consumption and 
the number of cavities are substituted in Equation 28.72, the partial correlation coefficients 
"yx x, = .96 and yy x, = .60 are computed. 











. In me 000.955 - (52637 — gc 
YX,.X, - z =., 
fa - rd - rex) VE- S - c] 
r. = Fop t E 
" ny fetu, 52 - (95903) —— eo 





a ja m E 


The value ry, y = .96 indicates that the correlation between sugar consumption and the 
17572. 
number of cavities, with salt consumption removed, is .96. The value ry, y = .60 indicates 
2t 


that the correlation between salt consumption and the number of cavities, with sugar consumption 
removed, is .60. Although the partial correlation between two variables is generally (but not 
always) smaller than the zero order correlation (which is the term that refers to the correlation 
between the two variables before the effect of the third variable has been removed), this is not 
the case in the example under discussion (since the partial correlations Ty x, = .96 and 


Tyy x, = .60 are larger than the zero order correlations Tyy = .955 and Tyy, = .52). When 


a partial correlation is substantially different from the corresponding zero order correlation 
(especially when the absolute value of the partial correlation is substantially larger), it may 
indicate the presence of a suppressor variable. A suppressor variable is a predictor variable 
that can improve prediction on the criterion variable by suppressing variance that is irrelevant 
to predicting the criterion variable. In a set of three variables, a suppressor variable is a predictor 
variable ( X.) that has a low correlation with the criterion variable (Y), but a high correlation with 
the other predictor variable (X;). By virtue of the latter, inclusion of the suppressor variable in 
the analysis may result in a multiple correlation coefficient (Ry X, x) that has a larger absolute 


value (or even a different sign) than the zero order correlation coefficient between the cri- 
terion variable and the suppressor variable (ry, ). It should be noted that suppressor variables 


can create major problems in interpreting the results of a multiple regression analysis. For a more 
detailed discussion of suppressor variables the reader is referred to Cohen and Cohen (1983). 
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The square of a partial correlation coefficient represents the proportion of variability ex- 
plained on the criterion variable by one of the predictor variables, after removing any linear 
effects of the other predictor variable from the other two variables. In the case of ry, y = .96, 

1-42 


ry, x, = (96)? = .92. Thus, 92% of the variability on the criterion variable can be accounted 


for on the basis of the predictor variable X, , when the linear effects of variable X, are removed 
from the other two variables. In the case of ry, y = .60, Ty, x (.60 = .36. Thus, 36% 
2*^1 ^ 


of the variability on the criterion variable can be accounted for on the basis of the predictor vari- 
able X,, when the linear effects of variable X, are removed from the other two variables. 


Test 28l-a: Test of significance for a partial correlation coefficient The null hypothesis 
Hy P, = 0 can be evaluated with Equation 28.73 (where p, represents the population partial 
correlation coefficient).? 


t= 2——— (Equation 28.73) 


Where: r, is the partial correlation coefficient 
v is the total number of variables employed in the analysis 


When there are two predictor variables and one criterion variable, the total number of 
variables employed in the analysis is 3, and thus yn - v = yn - 3. The valuen -v=n- 3 
represents the number of degrees of freedom employed for the analysis. Employing Equation 
28.73, the null hypothesis is evaluated in reference to the partial correlation coefficients 
Tyy x, = 96 and ry, y = .60. 


p= LAWS -3 495 — = LOWS 73 _ 106 
ii = C99? V1 = C60? 


It will be assumed that the nondirectional alternative hypothesis H,: p, * 0 is evaluated. 
Employing Table A2, for df = 2 (since df = 5 - 3 = 2), the tabled critical two-tailed .05 and 
.01 values are £9; = 4.30 and ft), = 9.93. Since the obtained value ¢ = 4.85 is greater than 
tos = 4.30, the nondirectional alternative hypothesis H;: Pyx,. x, * O is supported at the .05 


level (but not at the .01 level). Since the obtained value t = 1.06 is less than t), = 4.30, the 
nondirectional alternative hypothesis Hi: Pyy x, * 0 is not supported. 


When there are two predictor variables, a partial correlation can obviously only eliminate 
the effect of one other predictor variable. This kind of partial correlation is often referred to as 
a first-order partial correlation. When there are more than two predictor variables, it is pos- 
sibleto compute higher-order partial correlations in which the effects of two or more predictor 
variables are eliminated. Thus, a second-order partial correlation is one in which the effects 
of two predictor variables are eliminated. A discussion of higher-order partial correlation can 
be found in Cohen and Cohen (1983), Hays (1994), and Marascuilo and Levin (1983). 


Test 28m: The semipartial correlation coefficient A semipartial (or part) correlation 
coefficient measures the degree of association between two variables, with the linear association 
of one or more other variables removed from only one of the two variables that are being corre- 
lated with one another. In the case of two predictor variables and a criterion variable, a 
semipartial correlation coefficient measures the degree of association between two variables, with 
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the influence of the third variable removed from only one of the two variables that are being 
correlated with one another. Thus, the semipartial correlation coefficient Tyx,.x, Tepresents 
the correlation between Y and X, after any linear association that X, has with X, has been re- 
moved. 

When there are three variables, it is possible to compute the following six semipartial 
correlation coefficients: Nyx, x» TYR. xo P xy Ax o.» xy. xy x9 Equation 28.74 
is the general equation for computing a semipartial correlation coefficient involving three 
variables (where A, B, and C represent the three variables). The notation r A(p. c Tepresents the 
correlation between A and B, after any linear relationship that C has with B has been removed. 


Pos Tyger 
riB. = a (Equation 28.74) 


y1 - fie 


Employing Equation 28.74, the semipartial correlation coefficients Tyx,.x,) = .82 and Tyx,.x) 7 18 
are computed. 


Hx EE i -955 9/0592) (9:7) 








Ty X) = = .82 
"nu Bx V1 - (37% 
52 - (.955)(.37) 
cs = z (955X37 _ 18 


yy, B yg, rx, X, 
EER Vi = (377 

The square of a semipartial correlation coefficient represents the proportion of variability 
explained on one of the variables by a second variable, after removing the linear effect of a third 
variable from the second variable. In the case of "yx, x) 7 82, Y x) = (.82)? = .67. Thus, 
67% of the variability on Y can be accounted for on the basis of X, when the linear effect of 

; 2 
X, is removed from X,. In the case of Tx, x) = 48, ry, x) = (.18? = .03. Thus, only 3% 
of the variability on Y can be accounted for on the basis of X, when the linear effect of X, is 
removed from X,. 

Marascuilo and Serlin (1983) note that although it is theoretically possible for the two 
values to be equal (in reference to the same variables), a semipartial correlation coefficient will 
have a smaller absolute value than a partial correlation coefficient. The latter can be confirmed 
by the fact that the partial correlation coefficient Tyy x, = .96 is larger than the semipartial 
correlation coefficient Tix. x) = .82, and the partial correlation coefficient Tx. x = .60 is 
larger than the semipartial correlation coefficient Ty, x) = .18. 


Test 28m-a: Test of significance for a semipartial correlation coefficient The null hypothe- 
sis Hy po 0 can be evaluated with Equation 28.75 (where p,, represents the population 
semipartial correlation coefficient). 


rV 7 V 
2 
y1 - ES 


T p is the semipartial correlation coefficient 
v is the total number of variables employed in the analysis 


t= (Equation 28.75) 


Where: 
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When there are two predictor variables and one criterion variable, the total number of 
variables employed in the analysis is 3, and thus yn - v = yn - 375 The valuen -v2n-3 
represents the number of degrees of freedom employed for the analysis. Employing Equation 
28.73, the null hypothesis is evaluated in reference to the semipartial correlation coefficients 


= .82 andr = .18. 


Pyx,.X,) YOt,. Xj) 


pa 685 73 _ 9.93 
y1 - (82? 


- (1805 -3 _ 26 
y1 - (18 


It will be assumed that the nondirectional alternative hypothesis H,: p,, # O is evalu- 
ated. Employing Table A2, for df= 2 (since df = 5 - 3 = 2), the tabled critical two-tailed .05 
and .01 values are tf), = 4.30 and tf), = 9.93. Since both of the obtained f values are less 
than £9, = 4.30, the nondirectional alternative hypotheses H: Py, x) * 0 and H;: Pa, x) 
* 0 are not supported. 


Final comments on multiple regression analysis Figure 28.7 (based on Cohen and Cohen 
(1983)), which is known as a Venn diagram, provides a visual summary of the proportion of 
variance represented by zero order, multiple, partial, and semipartial correlation coefficients 
(when there are two predictor variables and a criterion variable). Each of the three circles rep- 
resents the variance of one of the three variables. Areas of overlap between circles represent 
shared variance between variables. 


. 2 2 2 
Zero order correlations: ry y =d +e tye eof T% x% 5e+b 


Multiple correlation: R xx, ~dtertf 








: : d 
Partial correlations: T = d r E = f Í 
+g *g 
Semipartial correlations: Tee) -d r Drs =T 


Sources that provide comprehensive coverage of the general subject of multiple regression 
analysis (e.g., Cohen and Cohen (1983) and Marascuilo and Levin (1983), Stevens (1986, 1996), 
and Tabachnick and Fidell (1989, 1996)) describe additional analytical procedures (many of 
which involve matrix algebra), as well as covering other issues that are relevant to the interpre- 
tation of a such an analysis. It should also be noted that it is possible to conduct curvilinear 
multiple regression analysis, in which case a multiple regression equation is derived that uses a 
curvilinear combination of the predictor variables to predict scores on the criterion variable. As 
noted earlier, because of the tedious computations involved, multiple regression analysis is 
generally not practical to employ unless one has access to the appropriate computer software. 


3. Additional multivariate procedures involving correlational analysis This section of the 
Addendum will describe a number of multivariate statistical procedures that directly or indirectly 
involve correlational analysis. The descriptions in this section are nonmathematical in nature, and 
only one procedure (factor analysis) is described in detail. Readers interested in comprehensive 


© 2000 by Chapman & Hall/CRC 


Figure 28.7 Venn Diagram of Variance Components Represented 
by Squared Correlation Coefficients 


discussions of the procedures described in this section should consult sources that specialize in 
multivariate analysis. Two excellent references are Stevens (1986, 1996) and Tabachnick and 
Fidell (1989, 1996). 


Factor analysis Factor analysis is one of a number multivariate statistical procedures 
discussed in the book. As noted under the discussion of the Hotelling’s T? test (discussed in 
Section VII of the ¢ test for two independent samples), the most general use of the term 
multivariate is in reference to procedures that evaluate experimental designs in which there are 
multiple independent variables and/or multiple dependent variables. Factor analysis is a 
statistical technique that is commonly employed in a broad spectrum of academic disciplines, in 
order to eliminate redundancy in a large body of data. To be more specific, in factor analysis a 
set of data which is comprised of many intercorrelated variables is transformed into a format that 
consists of a limited number of variables. The latter variables, which are referred to as factors, 
represent the basic underlying dimensions that are responsible for variability in the original set 
of data. Since the goal of factor analysis is to identify the basic elements that comprise a body 
of data, in many respects it is similar to breaking matter down into its basic elements. Just as by 
virtue of combining the existing chemical elements the chemist is able to account for all varieties 
of matter, in factor analysis it is assumed that when the derived factors (which are analogous to 
the elements) are combined with one another, they will allow a researcher to account for all or 
most of the variability in a set of data. 

The description of factor analysis to be presented in this section is best categorized under 
the rubric of exploratory factor analysis (as opposed to confirmatory factory analysis, which 
is used to confirm that based on theory or preexisting empirical evidence, a set of data will 
conform to a specific factorial structure). Exploratory factor analysis employs a methodology 
referred to as principal components analysis (which is both mathematically and conceptually 
the simplest of the factor analytic procedures) to transform a body of data that is comprised of 
a large number of variables into a smaller set of variables (referred to as factors or principal 
components), which are linearly related to one another. The principal components/factors de- 
rived from the analysis should be minimal in number, yet at the same time should account for 
a large proportion of the variability in the original set of data. 


© 2000 by Chapman & Hall/CRC 


To illustrate factor analysis, a simple example will be employed in which the procedure will 
be used to study the personality structure of human beings. When factor analysis is employed 
within the latter context, it is most commonly used to identify the most elementary traits that can 
be employed to describe personality. It should be noted, however, that factor analysis can be 
more limited in scope. For instance, one might take a single trait such as anxiety, and through 
the use of factor analysis determine if anxiety can be broken down into a limited number of more 
elementary components (e.g., anxiety based on fear of death or injury versus anxiety based on 
fear of psychologically threatening stimuli). The steps involved in factor analysis will now be 
described. 

Step 1 — Accumulating the data: In conducting a factor analysis, the first thing a 
researcher must do is to select a set of measures to factor analyze. What set of measures the re- 
searcher will select will depend upon the nature of the problem one is studying. If the researcher 
wishes to identify the basic traits that can be employed to explain individual differences in 
personality, she should select a large number of measures which encompass all aspects of human 
behavior (which are a function of the personality). It is important to note that the specific types 
(as well as number) of measures a researcher selects will have a direct impact on the factors one 
will derive through use of factor analysis. The fact that two or more researchers employing 
factor analysis may reach different conclusions in studying the same subject matter can, among 
other things, be attributed to the fact that they may have employed different types of and/or 
amounts of data. 

For purposes of illustration let us assume that a researcher wants to determine whether or 
not the traits measured by six commonly used personality tests can be expressed within the 
framework of a more limited number of dimensions. The researcher elects to employ factor 
analysis, since she believes there is considerable redundancy among the six tests. In order to 
conduct the factor analysis it is necessary that the researcher administers each of the six tests to 
a large number of people. In our example it will be assumed that 1000 people are employed, and 
that each person is administered six tests which measure the following traits: Test A — Anxiety; 
Test B — Somatic Complaints; Test C — Guilt; Test D — Friendliness; Test E — Sensation 
Seeking; Test F — Dominance. 

Step 2 — Constructing the correlation matrix: Once scores are obtained for the 1000 
subjects on each of the six tests, correlations are obtained between the scores of subjects on all 
six tests. The results of such an analysis can be summarized in a correlation matrix, which is 
a table that contains the values of the correlations between three or more measures. Table 28.9 
represents the correlation matrix for our example. 


Table 28.9 Correlation Matrix (Personality Test Intercorrelations) 


Test A B C D E F 
A - .93 .86 .15 21 26 
B x - .83 .12 .18 22 
C - - - .05 .10 .13 
D - - — - .62 .72 
E - - - — - 78 
F A 2 = — = = 


Each entry in Table 28.9 represents the value of the correlation between two of the six tests. 
To determine the two tests represented by any of the correlations in the table, identify the letter 
at the left of the row and the letter at the top of the column in which a specific correlation appears. 
The row and column letters represent the two tests for which that correlation was computed. 
Thus, as an example, the correlation between Test A and Test B is .93, since r = .93 appears in 
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Row A and Column B. Note that correlations only appear in the upper half of the table, since the 
same information would be repeated in the lower half of the table (i.e., as an example, the 
correlation .93 would also appear in the cell with an x, which is in Row B and Column A). Note 
that the cells in the diagonal of table have been left blank. In actuality, a number (typically 1 or 
some other value) is employed in the diagonal cells, based upon certain defining characteristics 
of the type of factor analysis one is conducting. 

Step 3 — Conducting the factor analysis: Factor analysis (which is impractical to con- 
duct without a computer) is a complex statistical procedure that determines whether or not the 
data can be broken down into a more limited number of dimensions. The term factor derives 
from mathematics. Recollect in basic algebra where you were taught to factor an equation such 
as the one below. 


(x? - y) = @& + y)@ - y) 


By factoring the above equation one has broken it down into two basic elements, which, 
when combined, result in the equation. In the same respect, factor analysis of a correlation 
matrix, such as that depicted in Table 28.9, will determine whether or not a more limited number 
of basic elements/dimensions can be employed to summarize and explain the information 
provided by the six tests. At this point it will be assumed that the appropriate mathematical 
operations associated with factor analysis (which are too complex to describe here) are 
conducted. 

Step 4 — Interpreting the results of the factor analysis: Table 28.10 summarizes the 
results of the factor analysis. 


Table 28.10 Summary Table of Factor Analysis of Six Personality Tests 


Factor 
Test I II Communality 

A .98 .13 .9773 

B .94 .09 .8917 

C .88 .01 .TIAS 

D .05 .76 5801 

E 1 81 .6682 

F 14 94 .9032 
Eigenvalue 2.6526 2.1424 Sum = 4.7950 
Percent of Total Variance 44.21% 35.71% 


Table 28.10 indicates that the factor analysis yielded two factors. This means that most of 
the variance in our data (i.e., individual differences between subjects) can be accounted for by the 
two derived factors. Note that in the columns for Factor I and Factor II are a set of numerical 
values. These values are called factor loadings. A factor loading is a correlation coefficient (and 
thus it will always fall within the range +1 to - 1) that tells the researcher how much each of the 
variables (in this case, each of the tests) correlates with each of the factors. As is also the case 
with a correlation coefficient, the absolute value of a factor loading indicates the strength of the 
relationship between that factor and a given variable. The higher the absolute value of a factor 
loading, the more that variable contributes to that factor, or, to put it another way, the higher the 
factor loading, the purer a measure that variable is of that factor. Thus, Test A has a loading of 
.98 on Factor I and a loading of .13 on Factor II. This indicates that Test A is measuring Factor 
I to a much greater degree than it is measuring Factor II. By squaring the factor loading of a test, 
one can determine how much of the variance on the test can be accounted for by that factor. 
Thus, (.98)? = .9604 = 96.04% of the variance on Test A can be accounted for on the basis of 
Factor I, whereas Factor II only accounts for (. 13)? = .0169 = 1.69% of the variance on Test 
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A. Taken together, Factors I and I account for 97.73% (1.e., 96.04% + 1.69%) of the variance 
on Test A. The value 97.73% (which expressed as a proportion = .9773) corresponds to the 
communality of the test (which is listed in the last column of Table 28.10). The concept of 
communality will be discussed in more detail later in this section. 

Since all of the variance on Test A (as well as on the other five tests) is not accounted for 
on the basis of the two factors described in Table 28.10, one might ask why only two factors have 
been employed in Table 28.10 to describe the results of the factor analysis. Or, to put it another 
way, how does one decide how many factors to derive in a factor analysis? 

As Kachigan (1986) notes, in interpreting the results of a factor analysis, a researcher must 
weigh parsimony against comprehensiveness. Thus, although the researcher wishes to account 
for as much of the variability in the data as possible (comprehensiveness), at the same time she 
wants to do it in the simplest possible manner (parsimony — i.e., with the fewest number of 
factors). There is no set rule with respect to how much of the total variance must be accounted 
for by the factors a researcher derives. (Stevens (1996) notes that some researchers attempt to 
account for a minimum of 70% of the variability.) In essence, how many factors one decides to 
employ will ultimately depend on the purpose for which one intends to use the results of a factor 
analysis. Those factors that explain the greatest amount of variability in the data almost always 
represent what are referred to as common factors. Common factors are factors that load on more 
than one of the variables. Those common factors which account for a substantial amount of the 
variability are designated as significant factors in a factor analysis. Factors I and II in Table 
28.10 represent common factors, since, on both of these factors, all six tests have loadings above 
Zero, and at least three of the tests have substantial loadings on one of the two factors. In contrast 
to common factors are specific factors, which are factors that load on only one of the original 
variables that are employed in the factor analysis. Within the framework of factor analysis, 
specific factors generally account for only a small portion of the total variance, and typically do 
not play an important role in explaining the data. In addition to specific factors, another element 
that accounts for a small portion of the total variance are error factors (also referred to as error 
variance). Error factors represent uncontrolled variability — i.e., such things as poor reliability 
in measuring the variables, and/or other sources of error in the data that are beyond the control 
of the researcher. 

Returning to our problem, the exact amount of the variance that can be accounted for by the 
two factors represented in Table 28.10 can be obtained by adding up the numbers in the last row 
(labelled Percent of Total Variance) of the table. Thus: 44.21% + 35.71% = 79.9296. This tells 
us that 79.92% of the total variability in the data can be explained by the two factors, and that 
of that 79.92%, Factor I accounts for 44.21% of the variability, while Factor II accounts for the 
remaining 35.71% of the variability. In a factor analysis summary table, the first factor listed 
(identified as Factor I) will always account for the greatest amount of variability, followed by 
Factor II which will account for the next largest amount of variability, and so forth. If the results 
of the factor analysis were depicted in greater detail, and those additional factors that accounted 
for minimal variability were included in Table 28.10, 100% of the variability would be accounted 
for. In any event, a factor analysis based on six initial variables in which two factors account for 
79.92% of the variance, would be considered an excellent compromise between parsimony and 
comprehensiveness — i.e., one that explains most of the variability through use of a minimum 
number of factors. 

At this point in the discussion, two other sets of values depicted in Table 28.10 will be 
explained — specifically, eigenvalues and communalities. An eigenvalue is a numerical index 
that indicates the relative strength of each of the derived factors. On a more technical level, 
Kachigan (1986) notes that an eigenvalue (also known as alatent root) is the equivalent number 
of variables a factor represents. As an example, a factor with an eigenvalue of 4 accounts for as 
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much variance in the overall data as one would expect for four variables, if the total variability 
were evenly distributed among all of the variables. Thus the higher the eigenvalue associated 
with a factor, the larger the role that factor plays in explaining variability in the complete set of 
data. The value of an eigenvalue can range from any number above zero up to the number of 
variables being factor analyzed (which in our example is six). In order to employ an eigenvalue 
to determine the relative strength of a factor (in terms of percentage of variability that factor 
accounts for) one should do the following: a) Divide the value of the eigenvalue by the number 
of variables employed in the factor analysis; and b) Multiply the result of the division by 100. 
The resulting value will be the percentage of variability in the data that can be accounted for by 
that factor. 

Thus, in our example: For Factor I: 2.6526 + 6 = .4421, and .4421 x 100 = 44.21%; For 
Factor II: 2.1424 + 6 = .3571, and .3571 x 100 = 35.71%. Note that the values 44.21% and 
35.71% correspond to the values in the bottom row (labelled Percent of Total Variance) of 
Table 28.10. A common rule employed by some researchers in factor analysis is to only employ 
factors that have an eigenvalue of 1 or greater (i.e., factors that at least account for the same 
amount of variance as one variable would be expected to). As a general rule, factors that have 
eigenvalues less than one will represent specific factors or error factors whose contribution in 
explaining the overall variability in the data is minimal. 

The amount of the variance on any variable that can be explained by the derived common 
factors is referred to as communality. The communality of a variable is derived by squaring the 
factor loadings of the variable on each of the factors, and summing the squares. The sum of all 
the squared factor loadings on a variable represents the communality of that variable. Commun- 
ality values, which (as previously noted) are listed in the last column of Table 28.10, will always 
fall within the range 0 to 1. The communality for Test A is .9773 (97.73% when expressed as a 
percentage), since as noted previously: (.98)? + (.13)? = .9773. Thus, 97.73% of the variance 
on Test A can be explained by Factors I and II. Of the 100% total variance on Test A, only 
2.27% (i.e., 10096 - 97.73% = 2.27%) cannot be accounted for by either Factor I or Factor II. 
The remaining 2.2746 of the variance may be explainable through other factors not depicted in 
Table 28.10. 

The following additional points with respect to the values depicted in Table 28.10 should 
be noted: a) The sum of the communalities is equal to the sum of the eigenvalues. This will 
always be the case in a factor analysis summary table that has the same basic structure as Table 
28.10; b) If, in fact, all possible factors in a set of data, including specific factors and error factors 
are derived, and the squared factor loadings for each variable are summed, the communality of 
each variable will equal 1. In such a case the sum of the communalities of all the variables will 
always equal the total number of variables employed in the factor analysis; c) It is also the case 
that if all possible factors in a set of data are derived, the sum of the eigenvalues of all of the 
factors will equal the total number of variables; and d) For any given variable, the total variance 
on that variable can be attributed to the following: 1) The common variance between the variable 
and all the factors derived in the factor analysis for which the variable has a factor loading other 
than zero; 2) Error variance, which is obtained by subtracting the reliability coefficient of the 
variable from the value 1; and 3) Variance specific to the variable, which is computed by 
subtracting from the value 1, the total common variance (1.e., communality) of the variable on 
all common factors, and the error variance. The value that remains after this subtraction 
represents the specific variance unique to that variable. This specific variance is independent of 
all other variables. 

Step 5 — Naming the factors: At the conclusion of a factor analysis, a researcher will 
generally assign a name to each of the factors. This is done by carefully examining the content 
of the variables (in this instance tests) that load high on a given factor. In our example, since 
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Factor I is essentially comprised of tests that measure Anxiety (Test A), Somatic Complaints (Test 
B), and Guilt (Test C), one might elect to label Factor I Neuroticism. This might be the case, 
since many mental health professionals would include the behavior/traits measured by Tests A, 
B, and C as characteristic of a neurotic individual. Thus, someone who has high scores on these 
tests would be likely to be viewed as high on neuroticism, while a person with a low scorer on 
these tests would be viewed as low on neuroticism. 

In the same respect, since Factor II is comprised of tests that measure Friendliness (Test 
D), Sensation Seeking (Test E), and Dominance (Test F), one might elect to label Factor II 
Extroversion. This would be based on the premise that the behaviors/traits measured by Tests 
D, E, and F would be viewed by many psychologists as underlying components of the more 
general trait of extroversion. 

Of course, one might challenge the above labels. For instance, one might, among other 
things, prefer to call Factor I Mental Health and Factor II Energy Level. The point to be 
made is that the naming of factors is based upon a subjective decision made by the researcher. 
It is conceivable, and not all that uncommon, that a name assigned to one or more factors by a 
researcher may be challenged by another researcher with respect to its appropriateness. 
Typically, in selecting a name for a factor, the larger the loading of a specific variable on that 
factor, the greater the role it should play in determining its name. 

Additional comments on factor analysis A score, referred to as a factor score, can be 
derived for any subject on a given factor. A subject's factor score will be a composite score 
based on the relative contribution of all of the variables that represent that factor. Thus, a 
subject's score on each variable (i.e., personality test in the example under discussion) is 
weighted accordingly with respect to the degree that it measures that factor. 

Since the sign of a factor loading is interpreted the same way as the sign of a correlation 
coefficient, the sign of a factor loading indicates the direction of the relationship between a 
subject's score on a variable and his or her score on that factor. Specifically, in the case of 
Factor I, the factor loading of .98 for Test A indicates that a subject who obtains a high score on 
Test A will have a high score on Factor I, and that a subject who has a low score on Test A will 
have a low score on Factor I. Just as positive factor loadings are interpreted as positive 
correlations, negative factor loadings are interpreted as negative correlations. Thus, if the factor 
loading for Test A was -.98, a subject who has a high score on Test A will have a low score on 
Factor I, and a subject who has a low score on Test A will have a high score on Factor I. If a 
factor loading for any variable is negative, the latter must be taken into account in determining 
a subject's score for that factor. 

Diekhoff (1992), among others, notes that the issue of statistical significance in reference 
to factor analysis addresses the question of whether or not the obtained factor structure for a set 
of data is a reliable indicator of the factor structure in the underlying population. To be more 
specific, if it is determined that the factor structure is statistically significant, it would be expected 
that the same factor structure will be obtained if a factor analysis is conducted on another sample 
that is drawn from the same population from which the original sample was derived. Although 
a number of tests of statistical significance (all of which are mathematically complex and assume 
a relatively large sample size) have been developed for factor analysis, there is a lack of agree- 
ment among sources with respect to which test is most appropriate to employ. Some sources, 
however, use the following guidelines for determining whether or not a specific factor loading 
is statistically significant: a) For smaller sample sizes, any factor loading with an absolute 
value of .40 or greater is considered significant; and b) For larger sample sizes, any factor load- 
ing with an absolute value of .30 or greater is considered significant. (The reader should note 
that the amount of explainable variance on a variable attributable to a factor which has a loading 
of .30 is only (.30 = .09 = 9% — certainly not a large amount by anyone's standards.). 
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Sources, however, do not agree on the values that constitute a “smaller” versus "larger" sample 
size. 

If one employs the criterion of .30 for determining significance, Table 28.10 reveals that 
Tests A, B, and C have significant loadings on Factor I, but not on Factor II, while Tests D, E, 
and F have significant loadings on Factor II but not on Factor I. This same information is, in 
part, revealed in Table 28.9 by virtue of the following: a) Tests A, B, and C are highly correlated 
with one another; and b) Tests D, E, and F are highly correlated with one another. Although in 
our example the pattern of intercorrelations in Table 28.9 suggests the underlying factorial 
structure of the data (1.e., that there are two primary factors, with one factor comprised of Tests 
A, B, and C, and the other factor comprised of Tests D, E, and F), such information will not 
always be obvious through inspection of a correlation matrix. To illustrate this point, if instead 
of six tests, the researcher had started out with 60 tests, it would be cumbersome (to say the 
least), and very likely impossible to discern the underlying factorial structure of the data by visual 
inspection of the correlation matrix. In any event, even if one is able to obtain a general picture 
of the factorial structure for a set of data, the information provided in Table 28.10 provides more 
precise information than does the correlation matrix. 

In actuality there are a number of different potential solutions that might result from the 
factor analysis of a set of data. A procedure referred to as rotation is regularly employed in 
factor analysis in order to arrive at what one considers to be the most useful solution. Since the 
factors that are derived in a factor analysis can be represented geometrically as well as mathe- 
matically, rotation is a procedure which involves rotating geometrical axes that serve as reference 
points for identifying the factors. Itis up to the researcher to choose the degree of rotation which 
she believes will provide the best solution for the data. 

Perhaps the most common practice in rotation is to configure the data so that each of the 
variables has a high loading on as few of the derived factors as possible (which, in fact, is the 
case in our example, since each test has a significant loading on only one of the factors). This 
type of rotation (which is employed within the framework of a procedure called the simple 
structure) is done in order to insure that one can come up with the most unambiguous and direct 
interpretation for each of the factors. To be more precise, Tabachnick and Fidell (1989, p. 637; 
1996, p. 675) note that “if simple structure is present (and factors are not too highly correlated 
with one another), several variables correlate highly with each factor and only one factor 
correlates highly with each variable." In an example such as ours in which each test only loads 
significantly on one factor, it is entirely possible that prior to rotation one or more of the tests had 
significant loadings on both Factor I and Factor II. It may have only been through the use of 
rotation that the data were reconfigured so that each test only had a significant loading on just 
one of the factors. A variable which has a high loading on only one factor (and thus, for the most 
part, only measures that factor) is called a factorially simple variable. On the other hand, a 
variable which has a high loading on more than one factor is referred to as a factorially complex 
variable. 

Within the framework of factor analytic procedures, there are a variety of options one may 
employ with respect to the rotation of factors. Aside from the degree of rotation one employs, 
the most notable options involving rotation revolve around the issue of whether one wishes to 
derive orthogonal versus oblique factors. Orthogonal factors are independent factors — i.e., 
factors that have zero correlation with one another. Oblique factors, on the other hand, are 
factors which are not independent of one another — i.e., factors which are correlated with one 
another. Oblique factors can themselves be treated as variables, and factor analyzed into lower 
order factors. As a general rule, most factor analyses that are conducted involve the derivation 
of orthogonal rather than oblique factors. 

The major criticisms directed toward factor analysis revolve around the subjective aspects 
of the procedure. As already noted, the type and degree of rotation employed by the researcher, 
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as well as the names one assigns to the derived factors, are based on subjective decisions. The 
criticism with respect to rotation is germane to the more general criticism that, within the 
framework of factor analysis, there are multiple procedures from which a researcher can choose, 
and the different procedures may not yield identical or even similar results. It is also important 
to realize that factor analysis will only provide useful information if it is employed with 
appropriate data. The variables a researcher initially elects to employ within the framework of 
a factor analysis must be carefully thought out in relation to the problem under study. If, due to 
prejudice or ignorance, a researcher ignores certain variables, the factor analysis will be unable 
to take such variables into account in describing the factorial structure of whatever it is that is 
ostensibly being studied. As Kachigan (1986, p. 400) notes, “Factor analysis does not create new 
information. It merely organizes, summarizes, and quantifies information that is fed into the 
system." In spite of the above noted criticisms, factor analysis is commonly employed by many 
researchers in multiple academic disciplines. Most researchers who have familiarity and 
experience with factor analysis would agree that when, used judiciously, it can be a powerful tool 
for evaluating a large body of data. 


Canonical correlation Canonical correlation is a statistical procedure that correlates two 
sets of variables with one another. One set, which is comprised of two or more X variables, 
represents the predictor variables, while the other set, which is comprised of two or more Y 
variables, represents the criterion variables. Note that in contrast to multiple regression, which 
involves multiple predictor (X) variables and a single criterion (Y) variable, in canonical 
correlation there are multiple sets for both variables. The goal in canonical correlation is to 
identify pairs of linear combinations involving the two sets of variables that yield the highest 
correlation with one another. The term canonical variate is employed to identify any linear 
combination comprised of X (or Y) variables that is correlated with a linear combination of Y 
(or X) variables. The procedure in canonical correlation searches for the set of canonical variates 
that yields the maximum correlation coefficient. The next set of canonical variates (uncorrelated 
with the first) is then identified which yields the next highest correlation, and so on. Kachigan 
(1986) notes that canonical correlation is most likely to be useful in situations where there is 
doubt that a single variable in and of itself can serve as a suitable criterion variable. Conse- 
quently, by determining if a set of criterion variables correlate with a set of predictor variables, 
a clearer picture may emerge regarding the relationship between the dimensions represented by 
the X and Y variables. Tabachnick and Fidell (1996) note, however, that sometimes the statistical 
solution that results from a canonical analysis may prove difficult to interpret. 

To illustrate canonical correlation, consider the following example. Let us assume a 
researcher has the following five lifestyle measures on a sample of subjects, which will represent 
the predictor (X) variables: Number of hours of exercise per week, number of grams of fat con- 
sumed per week, number of milligrams of caffeine consumed per week, number of grams of 
sugar consumed per week, and scores on a test assessing daily stress. In addition, the researcher 
has the following three scores as indices of health, which will represent the predictor (Y) 
variables: Diastolic blood pressure, body-fat ratio, and composite blood chemistry index of 
physical health. Canonical correlation can be employed to determine if there are reliable ways 
in which measures within the two sets of variables are related to one another. Thus, for example, 
one might find that a canonical variate comprised of two of the predictor variables (e.g., number 
of milligrams of caffeine consumed per week and number of grams of sugar consumed per week) 
is highly correlated with a canonical variate comprised of two criterion variables (e.g., diastolic 
blood pressure and composite blood chemistry index of physical health). 

Like most multivariate procedures, the mathematics involved in conducting canonical 
correlation are quite complex, and for this reason it becomes laborious if not impractical to 
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implement without the aid of a computer. Since a full description of canonical correlation is 
beyond the scope of this book, the interested reader should consult sources such Stevens (1986, 
1996) and Tabachnick and Fidell (1989, 1996) which describe multivariate procedures in detail. 


Discriminant analysis and logistic regression Discriminant analysis is a multivariate sta- 
tistical procedure that derives equations which are designed to predict group membership (which 
represents the dependent/criterion variable) employing a set of quantitative measures (i.e., 
variables measured on an interval/ratio scale), which represent independent/predictor variables. 
The criterion variable in discriminant analysis is a discrete/qualitative variable that is comprised 
of two or more categories (e.g., breast cancer survivors versus breast cancer fatalities; religious 
affiliation, ethnic category, etc.). The equations derived in the analysis (which are called dis- 
criminant functions) are similar to those in regression analysis, in that in both procedures the 
equations are linear combinations of predictors that are correlated with a dependent variable. 

Tabatchnick and Fidell (1996) note that discriminant analysis addresses the same 
questions that are evaluated with the multivariate analysis of variance (MANOVA) (discussed 
in Section VII of the single-factor between-subjects analysis of variance). However, in the 
latter procedure, group membership serves as the independent variable, and the multiple 
quantitative measures represent the dependent variables. If within the context of MANOVA a 
significant difference is found between the groups, it reflects the fact that the dependent variables 
are reliable predictors of group membership. 

To illustrate the application of discriminant analysis, assume a researcher wants to cate- 
gorize people into one of two groups (which will represent the dependent/criterion variable) — 
those who have had a silent heart attack and those who have not. The categorization with respect 
to group would be based on equations that employ subjects’ scores on the following four 
independent/predictor variables: cholesterol level, diastolic blood pressure, body fat ratio, and 
age. The equations derived in the disciminant analysis (1.e., the discriminant functions) will 
be linear combinations of the predictors that are correlated with the criterion variable. 

An alternative to discriminant analysis for predicting group membership is logistic re- 
gression. Asisthe case with discriminant analysis, in logistic regression a discrete/qualitative 
criterion variable is employed However, in logistic regression the independent/predictor vari- 
ables can be discrete/categorical, continuous, or a combination of both. In addition, logistic 
regression is more flexible than discriminant analysis, since unlike the latter its reliability does 
not depend on certain restrictive normality assumptions regarding the underlying population 
distributions for the predictor variables. Like most multivariate procedures, the mathematics 
involved in conducting both discriminant analysis and logistic regression are quite complex, 
and for this reason they become laborious if not impractical to implement without the aid of a 
computer. Since a full description of the aforementioned procedures is beyond the scope of this 
book, the interested reader should consult sources such as Stevens (1996) and Tabachnick and 
Fidell (1996) that describe both procedures in detail. Kachigan (1986) provides an easily under- 
standable nonmathematical discussion of discriminant analysis. 


4. Meta-analysis and related topics Meta-analysis is methodology for pooling the results 
of multiple studies that evaluate the same general hypothesis. The purpose of meta-analysis is 
to allow a research community to come to some conclusion with respect to the validity of a 
hypothesis that is not based on one or two studies, but rather is based on a multitude of studies 
which have addressed the same general hypothesis. Hedges and Olkin (1985) note that R. A 
Fisher (within the framework of the analyzing agricultural research) and Karl Pearson were 
among those who developed the first meta-analytic procedures in the 1930s. However, the work 
of Glass and his associates (Glass (1976, 1977) and Glass et al. (1981)) is largely responsible for 


© 2000 by Chapman & Hall/CRC 


popularizing the use of meta-analysis within the scientific community. More recently, Robert 
Rosenthal and Donald Rubin have contributed to the development of many of the analytical 
techniques that are presently employed with the framework of meta-analysis. 

Pooling the results of multiple studies that evaluate the same hypothesis is certainly not a 
simple and straightforward matter. More often than not there are differences between two or more 
studies which address the same general hypothesis. Rarely if ever are two studies identical with 
respect to the details of their methodology, the quality of their design, the soundness of execution, 
and the target populations that are evaluated. To further complicate matters, there is the issue of 
additional studies that may have evaluated the same hypothesis which were either never submitted 
for publication or were submitted but rejected. Rosenthal (1979, 1991, 1993) refers to this latter 
phenomenon (which will be discussed in detail later) as the file-drawer problem. In spite of the 
practical and theoretical difficulties involved in pooling the results of multiple studies, during the 
past 20 years numerous analytical procedures have been developed for this purpose. 

Hedges and Olkin (1985) and Rosenthal (1991, 1993) note that two general approaches 
characterize meta-analytic research. One approach involves procedures that evaluate statistical 
significance for the combined results of multiple studies, while the second approach estimates 
treatment effects across studies. Hedges and Olkin (1985), Rosenthal (1991, 1993), and Wolf 
(1986) note that one method for evaluating the statistical significance of the combined results of 
multiple studies is the vote-counting method. The latter procedure involves identifying all of 
the studies that are believed to evaluate the same general hypothesis, and then determining the 
number of studies that yield a statistically significant result. The proportion of significant studies 
is then contrasted with the proportion of studies that are not significant (through use of a pro- 
cedure such as the binomial sign test for a single sample (Test 9)). A variant of the vote- 
counting method statistically combines the probability values from two or more studies in order 
to compute a pooled probability value. Hedges and Olkin (1985) state that in spite of the 
intuitive appeal of such a simple and straightforward approach, the vote-counting procedure 
tends to be strongly biased toward the conclusion that there is no overall treatment effect for the 
variables under study. The latter bias is largely attributed to the relatively low power (due to the 
use of small sample sizes employed in studies in which a small to moderate effect size may be 
present) of research in certain disciplines such as the social and behavioral sciences. If one 
assumes low power, the vote-counting method will most likely only include as significant those 
studies in which the effect size is large, and fail to include studies where a weak or moderate 
effect size is present. Thus, the advantage of meta-analytic techniques which ignore the level of 
significance, but instead pool effect sizes, is that they circumvent the problem of low statistical 
power. Hedges and Olkin (1985) note that an optimal meta-analytic strategy should allow an 
investigator to compute an average treatment effect across all of the studies, as well as the 
consistency of the treatment effect across the studies. The general subject of treatment effects 
has been discussed throughout this book, often within the framework of the discussion of power. 
Specifically, various indices of treatment effect and measures of association (which are 
commonly employed as measures of treatment effect) are discussed in detail in Section VI of the 
following tests: The single-sample f test, the t tests for two independent and dependent 
samples, the chi-square test for r x c tables, and the tests that involve the analysis of variance 
procedures. Prior to describing a number of meta-analytic procedures, the subject of effect size 
will be discussed in greater detail. 


Measures of effect size The discussion will begin by clarifying the relationship between sta- 
tistical significance and effect size. Equation 28.76 is a general equation (presented by Rosenthal 
(1991, 1993) and discussed in Tatsuoka (1993)) which describes the relationship between effect 
size and a test statistic employed to measure statistical significance. 


© 2000 by Chapman & Hall/CRC 


(Equation 28.76) 
Effect size = Significance test statistic/Sample size 


The Effect size (ES) value on the left side of Equation 28.76 can be any one of various 
measures of effect size discussed throughout this book. The value designated Significance test 
Statistic will be the computed value for the inferential test statistic that is employed to determine 
statistical significance (e.g., a t, F, x7, etc. value). The number employed to represent the 
Sample size will be some index that reflects the overall size of a sample employed in a study (but 
will usually not correspond exactly to the total number of subjects employed in a study). The 
relationship in Equation 28.76 reflects the fact that if sample size varies, in order for the effect 
size to remain unchanged, there must be a direct relationship between the magnitude of the 
computed test statistic and the sample size (i.e., as the value of the sample size increases, the 
magnitude of the test statistic must increase). This relationship was demonstrated earlier in ref- 
erence to Example 16.1, which is employed to illustrate the chi-square test for r x c tables. In 
Section II of the latter test, Table 16.2 summarizes the data for Example 16.1. Analysis of the 
data in the latter table, which is comprised of 200 observations, yields a chi-square value of 
X? = 18.18. In Section VI of the chi-square test for r x c tables (under measures of 
association for r x c contingency tables), Table 16.22 summarizes the same experiment em- 
ploying numbers (in the rows, columns, and cells of the summary table) that are half the value 
of those employed in Table 16.2. The number of observations in Table 16.22 is 100, and the 
computed test statistic is X? = 9.1. Since the identical degrees of freedom are employed in 
the analysis of both tables (i.e., df= 1, with Xos = 3.84and Xo = 6.63), the level of signifi- 
cance represented by the p value obtained for Table 16.2 will be much lower than the p value 
obtained for Table 16.22. When Equation 28.76 is employed to compute the effect size index 
(ES) for Tables 16.2 and 16.22, the following values are obtained: ES = 18.18/200 = .091 (for 
Table 16.2) and ES = 9.1/100 = .091 (for Table 16.22). The aforementioned effect size values 
correspond to the square of the values that will be computed for the phi coefficient (v) (com- 
puted with Equation 16.18) for each of the tables. Note that the chi-square values computed for 
the two tables are proportional to the sample sizes — in other words, the chi-square value 
computed for Table 16.2 is two times the chi-square value computed for Table 16.22, and the 
sample size employed in Table 16.2 is two times the size of the sample employed in Table 16.22. 
Yet, in spite of the latter, the identical effect size is computed for both tables. The fact that the 
two effect sizes are equal illustrates that unlike the computed value of a test statistic and its 
associated probability value, the effect size is independent of sample size. Thus, a computed 
test statistic (e.g., a t, F, x7, etc. value) does not in itself provide information regarding the 
magnitude of a treatment effect. The reason for this is that the value of a test statistic is not only 
a function of the treatment effect, but is also a function of the size of the sample employed in an 
experiment. Since the power of a statistical test is directly related to sample size, the larger the 
sample the more likely a significant value will be obtained for the test statistic if there is an effect 
of any magnitude present in the underlying populations. Regardless of how small a treatment 
effect is present, the magnitude of the computed test statistic will increase as the size of the 
sample employed to detect it increases. Thus, it is impossible to determine from a significant test 
statistic and its associated p value (e.g., .05, .01, .001, etc.) whether the significant result is 
due to a large, medium, or small treatment effect. 

The wisdom of using the conventional hypothesis testing model (which employs the result 
of a test of statistical significance) is addressed in detail at the end of the discussion of meta- 
analysis. For some time there has been controversy regarding the wisdom of employing tests of 
significance, insofar as the results of such tests are a function of power, which as noted previously 
is a function of sample size. The material to be presented on this issue later will describe an 
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alternative hypothesis testing model which some researchers believe should be employed in lieu 
of the conventional model. As you will see, regardless of which hypothesis testing model a 
researcher employs, the key to effective hypothesis testing ultimately boils down to using a 
representative sample that is large enough to detect any meaningful effect(s) present in the 
underlying population(s). 

Atthis point a summary of the various indices that are employed to measure effect size will 
be presented. Throughout the discussion to follow, the terms effect size and treatment effect 
will be used interchangeably, since all of the measures described below are variously referred to 
as measures of effect size, measures of magnitude of treatment effect, measures of 
association, and correlation coefficients. All of these measures have been discussed previously 
in the book in reference to specific tests. 

There are essentially two types of effect size indices. One type of index expresses effect 
size in the form of a correlation coefficient. This type of effect size index is computed at the con- 
clusion of an experiment to indicate the proportion of variability on the dependent variable that 
can be attributed to the independent variable. Later in the discussion it will be illustrated that 
the summary value computed for an inferential test statistic (e.g., a t, F, X? value) can be trans- 
formed into a correlation coefficient, in order that the latter can be employed as a measure of 
effect size for a set of data. An example of the latter (which can be found in the first section of 
this Addendum) is the computation of the point-biserial correlation coefficient to represent 
a measure of effect size for a t value. 

A second type of effect size index is most commonly employed prior to conducting an 
experiment, in order to allow a researcher to determine the appropriate sample size to use to 
identify a hypothesized effect size. The latter type of index expresses effect size in terms of a 
difference score, which represents the difference (often in standard deviation units) between 
two underlying population parameters represented by two sample statistics, or the difference 
between a population parameter represented by a sample statistic and a hypothesized popula- 
tion parameter. Most of the effect size indices of this type were developed by Jacob Cohen, 
and are described in detail in his classic book on statistical power (Cohen (1977, 1988)). Since 
within the context of a meta-analysis there are times when a researcher may wish or be required 
to employ both types of effect size indices, equations are available for converting an effect size 
index based on a difference score into a correlational effect size index, and vice versa. Cohen 
(1977, 1988) describes the following effect size indices based on a difference score that are 
relevant to some of the inferential statistical tests discussed in this book: d, f, q, g, h, and w.” 
It should be noted that although some of the aforementioned indices can be employed to meas- 
ure effect size in a study that involves two or more experimental conditions, as will be noted later 
in this section, meta-analysis is generally confined to evaluating hypotheses that contrast only 
two experimental conditions. A brief description of each of the effect size indices described by 
Cohen (1977, 1988) follows. 


The d index The d index represents the difference between two means expressed in 
standard deviation units. The d index was previously employed in the computation of the power 
of the single-sample ¢ test (where using Equation 2.5, d = |p, - u|/o) and the f tests 
for two independent and dependent samples (where using Equations 11.10 and 17.14, 
d = |u, - p,|/o and d = |p, - u|/oy ). Cohen (1977; 1988, Ch. 2) has derived tables 
which allow a researcher to determine, through use of the d index, the appropriate sample size 
to employ if one wants to test a hypothesis about the difference between two means at a specified 
level of power. Cohen (1977; 1988, pp. 24-27) has proposed the following (admittedly arbitrary) 
d values as criteria for identifying the magnitude of an effect size: a) A small effect size is one 
that is greater than .2 but not more than .5 standard deviation units; b) A medium effect size is 
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one that is greater than .5 but not more than .8 standard deviation units; and c) A large effect size 
is greater than .8 standard deviation units. Equations 28.77 (Mullen and Rosenthal (1985), 
Rosenthal (1991, 1993) and 28.78 (Cohen (1977; 1988, p. 23) can be employed to convert an 
r value into a d value, and vice versa. Cohen (1977; 1988, pp. 23-27) states that the r value 
computed with Equation 28.78 is a point-biserial correlation (Typ) when p, = p,. When 
Po * Pı» Equation 28.78 becomes r = d/jd? + (LUpypy) 


d= (Equation 28.77) 





po (Equation 28.78) 


Tatsuoka (1993) notes that Glass (1976) has developed a sample analogue of Cohen's d 
index, which is designed to serve as a measure of association when a researcher has an exper- 
imental group and a control group. Equation 28.79 is employed to compute Glass's g index. 


XX 
go SS (Equation 28.79) 
$ 


[4 


In Equation 28.79, X, and X, respectively, represent the means of the experimental 
and control groups, and §, represents the estimated population standard deviation of the con- 
trol group. Note that Glass (1976) employs the standard deviation of the control group rather 
than a pooled standard deviation involving both groups in the denominator of his equation. He 
does this since he believes that if pooled variability is employed in the denominator of the 
equation, the relevant g value for a specific experimental group and a control group will be 
unduly influenced by the variability of other experimental groups which are not involved in the 
comparison. A more detailed discussion of Glass’s g index can be found in Tatsuoka (1993). 


Thef index The f index is a generalization of the d index to the case where there are three 
or more means. The f index was previously discussed in Section VI of the single-factor 
between-subjects analysis of variance. Cohen (1977; 1988, Ch. 8) has derived tables that allow 
a researcher to determine, through use of the f index, the appropriate sample size to employ if 
one wants to test a hypothesis about the difference between three or more means at a specified 
level of power. Cohen (1977; 1988, pp. 284—288) has proposed the following (admittedly 
arbitrary) f values as criteria for identifying the magnitude of an effect size: a) A small effect size 
is one that is greater than .1 but not more than .25; b) A medium effect size is one that is greater 
than .25 but not more than .4; and c) A large effect size is greater than .4. 

Tatsuoka (1993) notes that Hedges (1981) has developed the g' index as an alternative to 
thef index. Asis the case with the latter index, g' is designed to be used as a measure of effect 
size for the analysis of variance. Hedges' g' is computed with Equation 28.80. 


X, - X, 
g!- ——— (Equation 28.80) 


MS yw 


In Equation 28.80, X, represents the mean of the j " group and X, represents the mean 
J 


© 2000 by Chapman & Hall/CRC 


of the control group. ‚MS yg is the square root of the pooled estimate of within-groups vari- 
ability employed in the analysis of variance. A more detailed discussion of Hedges' g' index can 
be found in Tatsuoka (1993). 


The q index The gq index represents the difference between two Pearson product-moment 
correlation coefficients, where the latter values are expressed through use of Fisher's z, trans- 
formation. The equation for the q index is q - z, - z,, where z, and z, are the Fisher 
transformed z, values for the two correlation coefficients. Cohen (1977; 1988, Ch. 4) has de- 
rived tables that allow a researcher to determine, through use of the q index, the appropriate 
sample size to employ if one wants to test a hypothesis about the difference between two 
correlation coefficients at a specified level of power. Cohen (1977; 1988, pp. 113-116) has 
proposed the following (admittedly arbitrary) q values as criteria for identifying the magnitude 
of an effect size: a) A small effect size is one that is greater than .1 but not more than .3 ; b) A 
medium effect size is one that is greater than .3 but not more than .5; and c) A large effect size 
is greater than .5. 


Theg index The g index (not to be confused with Glass's g index discussed earlier) can 
be employed to compute the power of the binomial sign test for a single sample. The g index 
represents the distance in units of proportion from the value .50. The equation Cohen (1977, 
1988) employs for the g index is g = P - .50, where P represents the hypothesized value of 
the population proportion stated in the alternative hypothesis — in this instance it is assumed 
that the researcher has stated a specific value in the alternative hypothesis as an alternative to the 
value that is stipulated in the null hypothesis. Cohen (1977; 1988, Ch. 5) has derived tables that 
allow a researcher, through use of the g index, to determine the appropriate sample size to 
employ if one wants to test a hypothesis about the distance of a proportion from the value .5 at 
a specified level of power. Cohen (1977; 1988, pp. 147-150) has proposed the following 
(admittedly arbitrary) g values as criteria for identifying the magnitude of an effect size: a) A 
small effect size is one that is greater than .05 but not more than .15; b) A medium effect size 
is one that is greater than .15 but not more than .25; and c) A large effect size is greater than .25. 


The h index The h index can be employed to compute the power of the z test for two 
independent proportions (Test16d). The value his an effect size index reflecting the difference 
between two population proportions. h is computed through use of the arcsine transformation 
(discussed in Section VII of the ¢ test for two independent samples). The equation for the h 
index is h = , - p, (where d, and 6, are the arcsine transformed values for the propor- 
tions). Cohen (1977; 1988, Ch. 6) has derived tables that allow a researcher, through use of the 
h index, to determine the appropriate sample size to employ if one wants to test a hypothesis 
about the difference between two population proportions at a specified level of power. Cohen 
(1977; 1988, pp. 184—185) has proposed the following (admittedly arbitrary) h values as criteria 
for identifying the magnitude of an effect size: a) A small effect size is one that is greater than 
.2 but not more than .5; b) A medium effect size is one that is greater than .5 but not more than 
.8; and c) A large effect size is greater than .8. 


The w index The w index can be employed to compute the power of the chi-square 
goodness-of-fit test (Test 8) and the chi-square test for r x c tables. The value w is an effect 
size index reflecting the difference between expected and observed frequencies. The equation 
for the w index is w - (=P, - P IP The latter equation indicates the following: 
a) For each of the cells in the chi-square table, the proportion of cases hypothesized in the null 


hypothesis is subtracted from the proportion of cases hypothesized in the alternative hypothesis; 
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b) The obtained difference in each cell is squared, and then divided by the proportion hypoth- 
esized in the null hypothesis for that cell; c) All of the values obtained for the cells in part b) are 
summed; and d) w represents the square root of the sum obtained in part c). Cohen (1977; 1988, 
Ch. 7) has derived tables that allow a researcher to determine, through use of the w index, the 
appropriate sample size to employ if one wants to test a hypothesis about the difference between 
Observed and expected frequencies in a chi-square table at a specified level of power. Cohen 
(1977; 1988, pp. 224-226) has proposed the following (admittedly arbitrary) w values as criteria 
for identifying the magnitude of an effect size: a) A small effect size is one that is greater than 
.] but not more than .3; b) A medium effect size is one that is greater than .3 but not more than 
.5; and c) A large effect size is greater than .5. 

With the exception of Glass's g and Hedges’ g’, the effect size indices discussed above are 
typically employed for power computations prior to conducting an inferential statistical test.?? 
Consequently, it is far more common that the effect size indices employed within the framework 
of meta-analysis are correlational measures (an r value) that are based on the empirical data 
obtained in a study. At this point the use of the Pearson product-moment correlation coef- 
ficient as a measure of effect size will be discussed. 


Pearson r as a measure of effect size Although outside of the context of meta-analysis 
r? (the coefficient of determination which is discussed in Section V) rather than r is more 
commonly used to represent the measure of effect size, either of the values can be employed for 
this purpose. In discussing the use of the Pearson product-moment correlation coefficient as 
a measure of effect size, Cohen (1977; 1988, pp. 78-81) has proposed the following (admittedly 
arbitrary) r values as criteria for identifying the magnitude of an effect size: a) A small effect 
size is one that is greater than .1 but not more than .3; b) A medium effect size is one that is 
greater than .3 but not more than .5; and c) A large effect size is greater than .5. As previously 
noted in Section VI, Cohen (1977; 1988, Ch. 3) has derived tables for computing power that 
allow a researcher to determine the appropriate sample size to employ if one wants to evaluate 
an alternative hypothesis that designates a specific value for a population correlation (when the 
null hypothesis is Hy: p = 0). In addition to Pearson r, Rosenthal (1993) notes that any of the 
following measures of association (all of which are special cases of the Pearson product- 
moment correlation coefficient) can be used to represent an r value within the framework of 
a meta-analysis: a) The point-biserial correlation coefficient (discussed earlier in this 
Addendum), which is employed as a measure of association for the ¢ test for two independent 
samples (and can also be employed as a measure of association for the ¢ test for two dependent 
samples (Test 17)); b) The phi coefficient (9) (which is discussed in Section VII as well as in 
Section VI of the chi-square test for r x c tables), which is employed when both variables are 
dichotomous; c) Spearman's rank-order correlation coefficient, which is employed when both 
variables are in arank-order format. In Section VI of Spearman's rank-order correlation coef- 
ficient, it is demonstrated that the latter correlation coefficient is a special case of the Pearson 
product-moment correlation coefficient. 


Meta-analytic procedures? Rosenthal (1991,1993) describes the following four types 
of meta-analytic procedures (as well as additional procedures that will not be covered): a) Pro- 
cedures that compare two or more studies with respect to significance level. In the case of two 
studies, these procedures determine whether or not the p values obtained for two studies are 
significantly different from one another. In the case of three or more studies, the meta-analytic 
procedures determine whether the p values for the k studies (where k represents the number of 
studies) are homogeneous — i.e., consistent with one another; b) Procedures that combine the 
significance levels (1.e., p values) of two or more studies, and obtain a combined/pooled estimate 
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of the p value for all k studies; c) Procedures that compare two or more studies with respect to 
effect size. In the case of two studies, these procedures determine whether or not the computed 
values for the effect sizes of the two studies are significantly different from one another. In the 
case of three or more studies, the meta-analytic procedures determine whether the computed 
values for the effect sizes of the k studies are homogeneous — i.e., consistent with one another; 
and d) Procedures that combine the effect size values computed for two or more studies, and 
obtain a combined/pooled estimate of effect size for all k studies. 

It should be noted that when the results of two or more studies are compared or combined 
within the framework of a meta-analysis, it is assumed that the k studies are independent of one 
another (i.e., represent separate studies employing different subjects). To carry out the 
procedures to be described in this section, it is required that the test statistic representing the 
outcome of each of the k studies is standardized (i.e., that the result of each study is summarized 
by the same test statistic). The most common statistics employed for this purpose are values of 
z and r. In meta-analytic procedures that compare or combine significance levels, p values are 
converted into z values. In procedures that compare or combine effect sizes, the effect size is 
commonly expressed as an r value. At this point I will summarize a number of equations 
(described in Rosenthal (1985, 1991, 1993)) that allow a researcher to convert the results of an 
inferential statistical test into an r value. 

a) Within the framework of conducting a ¢ test for two independent samples or a ¢ test 
for two dependent samples, Equation 28.81 can be employed to transform a t value into an r 
value. 


pop (Equation 28.81) 


The value computed with Equation 28.81 is the square root of the value computed with 
Equation 28.45 (the equation for computing eta squared (Ñ), which is equivalent to the square 
of the point-biserial correlation (Typ) It should also be noted that within the framework of 
conducting at test, Equation 28.78 was presented earlier for conversion of Cohen's d index into 
an r value.^! 

b) Within the framework of conducting an analysis of variance where the degrees of 
freedom for the numerator equals 1 (i.e., two groups/conditions), Equation 28.82 can be 
employed to transform an F value into an r value. 


CES AN (Equation 28.82) 


F g df. 


c) Within the framework of conducting a chi-square test for r x c tables, where df = 1 
(i.e.,a 2 x 2 contingency table), Equation 28.83 can be employed to transform a chi-square value 
into an r value. 


oss IE (Equation 28.83) 


Note that Equation 28.83 is the same as Equation 16.18, which is employed to compute the 
phi coefficient (¢). 

d) If a researcher wants to transform a z value into an r value, Equation 28.84 can be 
employed. 
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yen m (Equation 28.84) 


Rosenthal (1991, 1993) notes that within the context of meta-analysis, he is only interested 
in single degree of freedom comparisons — which is the term he uses to refer to studies that 
compare two groups/conditions with one another. He states that meta-analytic procedures are 
of little use when df > 1, since, when an omnibus test (e.g., an analysis of variance, chi-square 
test, etc.) compares more than two groups, it becomes difficult or impossible to answer the 
questions addressed by meta-analysis with a high degree of precision. In other words, an 
omnibus test statistic based on more than two experimental conditions does not identify which 
of the conditions are significantly different from other another. The use of the term single 
degree of freedom comparison within the context of a two group/condition experiment refer 
to the following: a) When there are two groups/conditions, the between-groups/between- 
conditions degrees of freedom for the analysis of variance is equal to 1. When df;o sc = 1, 
the analysis of variance and the ¢ tests for two independent and dependent samples are 
equivalent procedures; and b) In the case of the chi-square test for r x c tables, df= 1 when two 
groups are contrasted with one another. 


Demonstration of meta-analytic procedures In this section the following four meta- 
analytic procedures described by Rosenthal (1985, 1991, 1993) will be presented: a) A 
procedure for comparing k studies with respect to homogeneity of significance level; b) A 
procedure for obtaining a combined significance level (p value) for k studies; c) A procedure for 
comparing k studies with respect to homogeneity of effect size; and d) A procedure for obtaining 
a combined effect size for k studies. 

Example 28.7 will be employed to demonstrate aforementioned meta-analytic procedures. 


Example 28.7 Five independent studies (to be identified by the letters A, B, C, D, and E) 
evaluating the same general hypothesis (e.g., patients who receive a specific type of therapy will 
do better than a no-treatment control group) are conducted over a two year period. All of the 
studies employ an independent groups design with an independent variable comprised of two 
levels. The analysis of the data in each of the studies involved the use of the t test for two 
independent samples to determine if there was a difference between the means of an experi- 
mental and control group. In addition, a point-biserial correlation (which will be represented 
by the notation r) was computed for each study,to determine the magnitude of any effect size that 
was present. In studies A, B, C, and D the mean score of the experimental group was higher 
than the mean score of the control group. The one-tailed probability values (based on the result 
of the t-test) and the point-biserial correlations computed for the studies follow: A (p =.05; 
r 2.60); B (p = .01; r=.50), C (p = .10; r = .20), D (p = .20; r = .30). The result of study E was 
in the opposite direction of the other four studies — i.e., the mean of the control group was 
higher than the mean of the experimental group. The one-tailed probability value (based on the 
result of the t-test) and the point-biserial correlation for study E follow: E (p = .09; r = .15). 
The total number of subjects employed in each of the studies were: A(20); B (40); C(10); D (30); 
E (25). Compare the p values and effect sizes with respect to homogeneity, and compute a 
combined/pooled p value and effect size for the five studies. 


At this point it should be stated that all probability values (p) used within the framework of 
the meta-analytical procedures to be described will be one-tailed. The reason for this is that we 
want to be able to designate the direction of the outcome of each study — i.e., whether the mean 
of the experimental group is larger than the mean of the control group, or vice versa. Thus, the 
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left tail of the distribution will represent one directional outcome and the right tail the other 
directional outcome. Numerically, the direction of an outcome will be designated in reference 
to a summary statistic (e.g., an r or z value) by assigning a plus sign to outcomes in one tail of 
the sampling distribution, and a minus sign to outcomes in the other tail. 


Test 28n: Procedure for comparing k studies with respect to significance level The 
procedure to be described in this section evaluates the following null and alternative hypotheses. 


Null hypothesis H,: The p values obtained for the k studies are consistent/ 
homogenous with one another. 


Alternative hypothesis H,: The p values obtained for the k studies are not consistent/ 
homogenous with one another. 


Equation 28.85 can be employed to evaluate whether k (where k > 2) probability (p) values 
are homogeneous ( i.e., consistent with one another). When the test to be described is employed 
with k = 2 studies, it simply evaluates whether there is a significant difference between the two 
p values. 


k 
L= EG- (Equation 28.85) 
ja 


Where: z, represents the average z value computed for the k published studies 
z; represents the z value for the j " study 


In order to employ Equation 28.85, it is required that each of the p values is converted into 
its corresponding standard normal deviate (1.e., a z value). In order to do the latter, we find the 
p value in Column 3 of Table A1 in the Appendix which corresponds to the p value obtained 
for a given study (note that the p values in Column 3 represent one-tailed probabilities). The z 
value in the same row (i.e., the value in Column 1) of Table A1 that corresponds to the latter p 
value is employed as its standard normal deviate. Thus, in the case of Study A, which has a p 
value of .05, the value z = 1.65 is employed to represent it, since the probability/proportion in 
Column 3 of Table A1 is .0495 (which is the closest value to .05). In the case of Study B, which 
has a p value of .01, the value z = 2.33 is employed to represent it, since the 
probability/proportion in Column 3 of Table A1 is .0099 (which is the closest value to .01). 
Employing the same methodology with Studies C and D, we find that the z values which 
correspond to the p values .10 and .20 are 1.28 and .84. In the case of Study E, the z value that 
corresponds to the p value .09 is 1.34. The z values for Studies A, B, C, and D (the outcomes of 
which are in the same direction) will all be assigned a positive sign. The z value for Study E (the 
outcome of which is in the opposite direction of the other studies) will be assigned a negative 
sign.” Thus, the five z values we will employ in Equation 28.85 are 1.65, 2.33, 1.28, .84, and 
-].34. 

The following protocol is employed for Equation 28.85: a) Compute the mean of the k 
Z; values; b) Subtract the mean from each of the k z, values, and square each difference score; 
and c) Compute the sum of the k squared difference scores. The resulting value, which is a chi- 
square value, represents the test statistic, for which the degrees of freedom are df = k - 1. 

We compute the average of the five z values to be z, - .95. When the latter value along 
with the five z scores that are computed above are substituted in Equation 28.85, we obtain the 
value 3? = 7.75. 


X3 = (1.65 - .95)? + (2.33 - .05Y + (1.28 - .95)? + (.84 - .95)? + (-1.34 - .95)? = 7.75 
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Since there are 5 studies, df= 5 — 1 = 4. Employing Table A4 in the Appendix, for df = 
4, Xs = 9.49 and Xo = 13.28 (the probabilities for these critical values are one-tailed). In 
order to reject the null hypothesis, the computed chi-square value must be equal to or greater than 
the tabled critical value at the prespecified level of significance. Since the obtained value 
x? = 7.75 is less than Xos 7 9.49, the null hypothesis is retained. In other words, the data 
do not indicate that the probability values obtained for the five studies are inconsistent (i.e., not 
homogeneous) with one another. 


Test 280: The Stouffer procedure for obtaining a combined significance level (p value) 
for k studies A number of procedures have been developed for obtaining a combined/pooled 
p value for k independent studies that evaluate the same general hypothesis. These procedures 
are relevant for obtaining a combined probability for studies that involve a directional hypothesis 
testing situation where df= 1 (i.e., a study in which two groups/conditions are contrasted with 
one another). Birnbaum (1954) and Rosenthal (1978, 1991) provide a good overview of the 
various procedures. The specific test to be described here, which was developed by Stouffer et 
al. (1949), computes a combined p value through use of Equation 28.86. Sources that discuss 
this test in greater detail are Conover (1999), Mosteller and Bush (1954), Rosenthal (1991, 1993), 
and Wolf (1986). 





zad (Equation 28.86) 


Where: z, represents the z value for the j " study 


As is the case with Equation 28.85 (employed for Test 28n), Equation 28.86 requires that 
we convert each of the p values obtained for the k studies into its corresponding standard normal 
deviate (i.e., a z score). Once again it should be emphasized that one-tailed probabilities are 
always employed. Since we have already computed the appropriate z values for Test 28n, we 
are ready to employ Equation 28.86. The protocol for the equation requires that we sum the k z j 
values, and divide the sum by the square root of k. The resulting z value represents the test 
statistic, which is evaluated with Table A1. 

When the z scores for Example 28.7 that correspond to the probability values obtained for 
the k = 5 studies are substituted in Equation 28.86, we obtain z = 2.13. 


169 *233 «128 «.84 + (134) _ 4.76 _ 
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2.13 





Employing Table A1, we determine that the one-tailed probability associated with z = 2.13 
is .0166. The latter value represents the combined probability for the five studies. Since p = .0166 
is less than p = .05, the combined probability derived from the test is statistically significant.“ The 
combined probability value of .0166 is an overall probability in favor of the outcome of the 
majority of the studies (in which the experimental group obtained a higher mean than the control 
group). It is important to note that in Example 28.7, the outcomes of the five studies (reflected 
by the five p values) appear to be reasonably consistent (although some might consider the con- 
trary outcome of Study E to be somewhat problematical). As will be noted later, when Equation 
28.86 yields a combined probability value that is significant, yet the outcomes of the studies are 
not homogeneous, the obtained combined probability must be viewed with great caution. 
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The file drawer problem Rosenthal (1979) employs the term file drawer problem to 
refer to the fact that there may be additional studies which evaluated the same hypothesis 
evaluated in a meta-analysis, yet were never submitted for publication or were submitted but 
rejected. Consequently, if one computes a statistically significant combined probability value 
for k studies with Equation 28.86, the following question might be asked: If, in fact, the null 
hypothesis is true, how many additional null studies would have to be conducted in order to 
render the combined probability value nonsignificant? A null study is one in which there is no 
difference between the two groups and, consequently, the values z = 0 and p = .50 will be 
obtained. Rosenthal (1979) derived Equation 28.87 from Equation 28.86 to answer the latter 
question. Equation 28.87 calculates the number of studies averaging null results that must be in 
the file drawer in order to increase the Type I error rate (i.e., combined p value) so that it equals 
a specific value (typically 5% or greater). 


X - Fg - z2 (Equation 28.87) 
Za 


Where: X represents the number of additional studies that are required to render the 
combined probability nonsignificant 
Z„ represents the critical one-tailed z value at the required level of statistical sig- 
nificance for the combined probability 
z, represents the average z value computed for the k published studies 


Rosenthal (1979) notes that if we employ the .05 level of significance, the one-tailed value 
Zos = 1.645 (which for purposes of greater precision is used rather than the usual z = 1.65) 
is employed to represent z, in Equation 28.87. When z, - 1.645 is substituted in the latter 
equation, it becomes Equation 28.88. 


X = PE - 2.706) (Equation 28.88) 


By employing Equation 28.88 one can determine how many additional null studies will be 
required in order for the combined p value for a hypothesis under study to equal .05. One 
additional null study above the computed value of X will render the combined probability above 
.05, and thus the combined probability for the general hypothesis will no longer be statistically 
significant. In the case of Example 28.7, if we substitute the values k = 5 and z, = .95 in 
Equation 28.88, we compute the value X = 3.34. 


3 2 
X = ——_[(5)(.95)? - 2.706] = 3.34 
THEO ] 


This result tells us that only four additional null studies evaluating the same general 
hypothesis are required to be in the file drawer to produce a nonsignificant combined probability 
(since 4 is the next integer number above the obtained value X = 3.34). If the results of four such 
studies are combined with the five studies documented in Example 28.7, the resulting probability 
computed with Equation 28.86 will be greater than .05. Specifically, if four p values of .50 and 
their corresponding z values of zero are added to Example 28.7, Equation 28.86 will yield the 
value z = 1.59, which is less than the tabled critical .05 value zo, = 1.65. Note that the sum of 
the z values, which equals 4.76, does not change when k = 9 (which represents the original five 
studies plus the four null studies), since z = O for each of the four additional studies. Thus: 
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z = 4.76//9 = 1.59. Rosenthal (1991, 1993) has addressed the question of how one might go 
about estimating the number of unpublished studies that remain in the file drawer. He suggests 
a conservative estimate of the upper limit of the number of unpublished studies might be 
approximated by the value 5k + 10 (e.g., if k 2 5, the minimum estimate for the number of studies 
in the file drawer will be (5)(5) + 10 = 35). 

The file drawer problem is most commonly discussed within the context of highly 
significant meta-analytic research — in other words, in instances where the combined probability 
for a hypothesis is at a statistically significant level, with the value of p being very low. In such 
an instance, a skeptic would want to rule out the likelihood that the null hypothesis is, in fact, 
true, and that many of the published studies, in reality, represent Type I errors (i.e., spuriously 
significant results). Since there is a bias toward publishing (as well as submitting) significant 
results, it is sometimes suggested that if all the unpublished studies in the file drawer were taken 
into account, support for many a hypothesis would evaporate. Rosenthal (1991) discusses the 
latter issue, as well as empirical studies that address the question of how publication bias can 
influence the results of a meta-analysis. 

At this point in the discussion, it is worth noting that it is possible to have a set of k studies 
and obtain a significant result for Test 28n, and also obtain a significant combined probability 
value for Test 280. As an example, let us assume that the outcomes of the five studies in 
Example 28.7 were such that three favored the experimental group and two favored the control 
group (in terms of which group obtained the higher mean). Let us also assume the following: 
a) The p values for the three studies which favored the experimental group are quite low (e.g., 
in the .001 range); and b) In the case of the other two studies, the results of both are significant 
at the .05 level in favor of the control group. 

Itis likely that if Test 28n is employed to evaluate the probabilities for the aforementioned 
five studies, we will conclude that the results of the studies are not homogeneous (i.e., a signif- 
icant chi-square value will be computed with Equation 28.85). Nevertheless, it is conceivable 
that when the five probabilities are evaluated with Test 280, a combined p value below .05 (in 
favor of the experimental group) will be obtained. Under such circumstances the combined 
probability value would have to be viewed with even greater caution than the combined 
probability of p = .0166, which was obtained through use of Equation 28.86 for Example 28.7. 
This is the case, since if all five studies are statistically significant, but three are in one direction 
and two are in the opposite direction, the consistency of the outcomes of the five studies leaves 
a lot to be desired. The lack of consistency in the probability values can be the result of any of 
the following: a) Differential effect sizes being present in the k studies; b) Differences in the size 
of the samples employed (which influence the power of each of the k tests); c) Differences in 
methodology; d) Errors in instrumentation or recording; e) Faulty data analysis; or f) Some 
combination of one or more of the aforementioned factors. In the discussion to follow, it will 
be emphasized that in order for a statistically significant combined probability value to be 
meaningful, there should be sufficient evidence it reflects the fact that the k studies employed in 
a meta-analysis yielded relatively homogenous results. The studies should not only exhibit 
consistency with regard to the direction of their outcome, but more importantly should exhibit 
consistency with respect to the magnitude of the effect size present in the k studies. 

Neither of the two analyses conducted up to this point (1.e., Tests 28n and 280) provide us 
with any information regarding effect size for Example 28.7. Before proceeding, it is important 
to reiterate that the p value obtained for any study is always a direct function of the power of the 
statistical test, and power is a direct function of the sample size employed in a study. In order 
to reject the null hypothesis if a small (or even modest medium) effect size is present, it will be 
required that the researcher employ a large sample size. To illustrate this point, assume that two 
studies are conducted involving two independent samples which evaluate the same hypothesis. 
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Let us also assume that in the underlying populations a small effect size characterizes the 
relationship between the independent and dependent variables. One study (A) employs a large 
sample size, while the other study (B) employs a relatively small sample size. In the case of 
Study A, it is very likely that we will be able to reject the null hypothesis, and the larger the 
sample size the lower the p value that will be obtained for that study. In the case of Study B, we 
will probably not be able to reject the null hypothesis, since the small sample size will severely 
compromise the power of the test — i.e., the test's ability to detect a difference between the 
groups. The computed p value for Study B will most likely be above .05, and, in fact, may be 
considerably larger. Yetif we compute the effect size for both studies, we obtain the same value. 
Let us assume that latter value is computed to be r = .15, which by Cohen’s (1977, 1988) 
standards constitutes a small effect size. Let us also assume that additional studies of the same 
hypothesis consistently obtain approximately the same effect size, but only a few of them — 
specifically, those which happen to employ a large sample size — yield significant results. 
Obviously, such an occurrence provides support for Hedges' (1995) and Rosenthal's (1991, 
1993) contention, that within the framework of meta-analysis, it is more prudent to compare 
and/or combine effect sizes than it is to compare and/or combine p values. The next two 
procedures to be presented are designed to do just that. 


Test 28p: Procedure for comparing k studies with respect to effect size The procedure 
to be described in this section evaluates the following null and alternative hypotheses. 


Null hypothesis H,: The effect size values obtained for the k studies are 
consistent/homogenous with one another. 
Alternative hypothesis H,: The effect size values obtained for the k studies are not 


consistent/homogenous with one another. 


Equation 28.89 can be employed to evaluate whether k (where k > 2) effect size values (as 
measured by r) are homogeneous (i.e., consistent with one another). (Equation 28.24 in Section 
VI is a different but equivalent version of Equation 28.89.) When the test to be described is 
employed with k = 2 studies, it simply evaluates whether there is a significant difference between 
the two effect size values. 


k 
L=» 36 rp (Equation 28.89) 
ja j : 


Where: Zr represents the Fisher transformed z, value for the j " study 
z, represents the average Fisher transformed z, value computed for the k published 
studies. It is a weighted average, since the sample size of each study is taken into 
account. The value of z, is computed with Equation 28.25. 


k . 
n, represents the sample size in the j study 


In order to employ Equation 28.89 (or Equation 28.24), it is required that the r value for 
each study (which, as noted earlier, is used to represent the magnitude of effect size) is converted 
into a corresponding Fisher tranformed value (discussed in Section VI). The latter is accom- 
plished through use of either Equation 28.18 or Table A17. Employing Table A17, the Fisher 
transformed z, values for the five studies are as follows. 


z,, = 693 z= 549 z, = 200 z, = 310 z, -.15 


T4 Tg 
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We will assign the Fisher transformed z, values for Studies A, B, C, and D a positive sign, 
since they are all in the direction that favors the experimental group. Since the outcome of Study 
Eis in the opposite direction, itis assigned a negative sign. Thus, for Study E we will use the 
value z, = -.151.? 

Employing Equation 28.25, the average of the five Fisher transformed z, values is 
computed to be NE .351. 


k 


D - 3)z, 
> _ ja j 
72 k 

Y (n, - 3) 

j=l 


_ (20-3)(.693) «(40 -3)(.549) «(10-3)(.203) +(30-3)(.310) «(25 -3)(-.151) 
(20-3) «(40-3) +(10-3) * (30-3) +(25 -3) 


=z, = .351 
k 


We are now ready to substitute the appropriate values in Equation 28.89. The following 
protocol is employed in using the latter equation: a) For each of the k studies, subtract the aver- 
age Fisher transformed z, value (i.e., f .351) from the Fisher transformed z, value for that 
study. Square the difference score, and multiply that value by the total sample size less three; and 
b) The sum of the k values computed in part a) is a chi-square value, which represents the test 
statistic. The degrees of freedom employed in evaluating the chi-square value are df = k - 1. 
The computations employing Equation 28.89 are shown below, which yield the value x? — 9.18. 


x? = (20 - 3)(.693 - .351)* + (40 - 3)(.549 - .351 + (10 - 3)(.203 - .351) 
+ (30 - 3)(.310 - .351)? + (25 - 3)(-.151 - .351)? = 9.18 


Since there are 5 studies, df= 5 - 1 = 4. Employing Table A4, for df = 4, Xs - 9.49 
and Xo = 13.28 (the probabilities for these critical values are one-tailed). In order to reject 
the null hypothesis, the computed chi-square value must be equal to or greater than the tabled 
critical value at the prespecified level of significance. Since the obtained value y? = 9.18 is 
less (albeit barely) than Xs = 9.49, the null hypothesis is retained. In other words, the data do 
not indicate that the effect sizes obtained for the five studies are inconsistent with one another. 
Realistically, however, since the outcome is so close to being significant, most researchers would 
probably be reluctant to accept that a homogeneous effect size has been demonstrated. Certainly, 
visual inspection of the r values does not suggest homogeneity. 


Test 28q: Procedure for obtaining a combined effect size for k studies Rosenthal 
(1993) employs Equation 28.90 to compute a combined/pooled effect size for k studies. 


pg o5 (Equation 28.90) 


Where: z, represents the Fisher transformed z, value for the j " study 
J 


z, represents the average Fisher transformed z, value computed for the k published 
k 
studies 
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The following protocol is employed in using the Equation 28.90: a) Obtain the sum of the 
Fisher transformed z, values for the k studies, and divide that sum by k; and b) Employing Table 
A17, convert the value computed for £. into its corresponding r value. The latter value rep- 
resents the combined effect size for the k studies. Employing part a) of the aforementioned 
protocol, through use of Equation 28.90 the value Z, =.321 is computed. Employing Table A17 
the latter value is converted into r = .31, which is the combined/pooled effect size. 


geda = + 310 + (-.151) _ m -— 


Rosenthal (1993) notes that, if one wished to weight studies on the basis of their sample 
size, the value t, = .351 computed with Equation 28.25 in the previous section can be used to 
represent the average Fisher transformed z, value (he also states weighting can be done on the 
basis of other criteria, such as the relative quality of the studies). The r value that corresponds 
to the Fisher transformed value t. = .351 is r = .34, which will represent the combined/ 
pooled effect size. Using Cohen's (1977, 1988) criteria (for an r value), the unweighted value 
r = .31 and the weighted value r = .34 both fall at the lower bound of the range for a medium 
effect size (the effect size criteria for an r value are noted on page 838). 

Rosenthal (1993) states, that one should view a combined effect size value with extreme 
caution when there is reason to believe that the k effect sizes employed in determining it were 
not homogeneous. Certainly, in such a case, the computed value for the combined effect size is 
little more than an average of a group of heterogeneous scores, and not reflective of a consistent 
effect size across studies. In view of this, in order for the combined effect size value computed 
with Equation 28.90 to be meaningful, it should have been previously demonstrated that the 
effect sizes for the k studies are relatively homogeneous. As noted earlier, the latter is 
questionable in our example, in spite of the fact that the result obtained with Equation 28.89 did 
not achieve statistical significance. 

In the final analysis, one should view combined/pooled probability and effect size values 
computed in meta-analysis as rough estimates. These values are subject to change as more data 
on a hypothesis become available. Certainly, if at a given point in time the available data reflect 
what is true in the underlying populations, the values generated in a meta-analysis will be 
reasonably accurate and not change substantially after additional data become available. 


Practical implications of magnitude of effect size value Before closing this section, a com- 
ment is in order concerning the relationship between the magnitude of an effect size and its 
practical implications regarding the relationship that exists between the independent and depen- 
dent variable. Rosnow and Rosenthal (1989) provide an interesting example (based on a study 
by the Steering Committee of the Physicians Health Study Research Group (1988)) involving the 
use of the phi coefficient. These authors illustrate that a low correlation coefficient (in this case 
the correlation is computed with the phi coefficient) need not necessarily indicate that the rela- 
tionship between two variables is trivial. Table 28.11 summarizes the results of the study, which 
evaluated the effect of aspirin versus a placebo on heart attacks. 

In order to determine whether the result of the study is statistically significant, we evaluate 
the data with the chi-square test for r x c tables. Employing Equation 16.2, the computed chi- 
square value for Table 28.11 is X? = 25.01. Since r = 2 and c = 2, the degrees of freedom are 
computed to be df = (2 - 1)2- 1) = 1. Employing Table A4, we determine that the tabled 
critical .05 and .01 chi-square values for df = 1 are Ys - 3.84 and Con = 6.63. Since the 
computed value y? = 25.01 is greater than both of the aforementioned critical values, the null 
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Table 28.11 Summary of Data for Heart Attack Study 


Y variable Row sums 
Heart attack = 0 No heart attack = 1 
X variabl Aspirin = 0 104 10,933 11,037 
varade Placebo = 0 189 10,845 11,034 
Column sums 293 21,778 22,071 


hypothesis can be rejected at both the .05 and .01 levels. The actual probability value associated 
with the result is less than .001. 

Now let us compute the magnitude of effect size for the study. Both Equations 16.18 and 
28.1 (as well as Equation 28.83) can be employed to compute the phi coefficient (which as noted 
earlier in this section can be viewed as an r value within the context of meta-analysis). 
Employing Equation 16.18, the value ọ = væn = y25.01/22071 = .034 is computed. As 
noted earlier, Equation 28.1 can also be employed to compute the value of phi (as is done with 
the data in Table 28.3). Thus, we designate the independent variable (which is comprised of the 
following two levels: aspirin versus placebo) as the X variable. The value 0 will be assigned as 
the X score for any subject who received aspirin, and the value | will be assigned as the X score 
for any subject who received the placebo. The dependent variable, which is whether or not a 
subject had a heart attack, will be the Y variable. The value 0 will be assigned as the Y score for 
any subject who had a heart attack, and the value 1 will be assigned as the Y score for any subject 
who did not have a heart attack. Employing the same protocol that is used to analyze the data 
in Table 28.3, we compute the following values for Table 28.11: XX - XX? - 11034; 
XY = YY? = 21778; XXY = 10845; n = 22071. When the aforementioned values are 
substituted in Equation 28.1, the value r 2 -.034 is computed, which is the same absolute value 
that is computed with Equation 16.18 for the phi coefficient (which can only be a positive 
value). 

The square of ọ = .034 (r = -.034) is d? = r? = .001156. Earlier in the book it was 
noted that the latter value, which represents the coefficient of determination, is commonly 
employed as a measure of effect size for a 2 x 2 contingency table. The value r? - .001156 
indicates that .1156% of the variability on the dependent variable can be accounted for on the 
basis of variability on the independent variable. Since .1156% is such a small value, one might 
get the impression that there is little if any relationship between taking aspirin and having a heart 
attack. Given the fact that the sample is comprised of 22,071 observations, it is not at all 
surprising that the probability value .001 is obtained when the chi-square test for r x c tables 
is employed to analyze the data. A sample size of this magnitude insures that the power of the 
chi-square test will be large enough to identify as statistically significant even a minimum effect 
size. Yet when Equation 16.24 is employed with the same data to compute the odds ratio (dis- 
cussed in Section VI of the chi-square test for r x c tables), we obtain the value o = 1.83: 
o = [(10933)(189)]/[(104)(10845)] = 1.83. The value o = 1.83 indicates that the odds of a person 
who received the placebo having a heart attack are 1.83 times larger than the odds of a person 
who received aspirin having a heart attack. If we employ relative risk as a measure (also 
discussed in Section VI of the chi-square test for r x c tables), through use of Equation 16.22, 
we can determine that the relative risk of having a heart attack if one received the placebo as 
opposed to getting an aspirin is [(189)/(11037)]/[(104)/(11034)] = 1.82. The latter value 
indicates that someone taking the placebo is 1.82 times more likely to have a heart attack than 
someone taking aspirin. The values computed for the odds ratio and relative risk clearly 
indicate that there is a definite advantage to a subject taking aspirin. In point of fact, since the 
researchers came to the latter conclusion, they terminated the study while it was still in progress 
— deeming it unethical to deprive the control subjects (i.e., the placebo group) of aspirin." 
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What can we conclude from the above example? As Cohen (1977, 1988) and Rosenthal 
(1991, 1993) note, there are circumstances when the strength of a relationship between an inde- 
pendent and dependent variable will be greater than the magnitude suggested by the coefficient 
of determination. A small r? value does not necessarily mean the relationship between the 
experimental variables is trivial, and is of no practical consequence. In the final analysis, 
depending upon how one expresses effect size, it is conceivable that a researcher may draw 
different conclusions regarding the strength of the relationship between the variables under study. 

In closing this discussion, it should be emphasized that there are procedures and issues 
related to meta-analysis that have not been covered in this book. For a more comprehensive 
discussion of meta-analysis, the reader should consult Hedges and Olkin (1985), Mullen and 
Rosenthal (1985), Rosenthal (1991, 1993), and Wolfe (1986). Hunt (1997) provides a good 
nonmathematical summary of the history and application of meta-analysis in the scientific 
community. 


The significance test controversy As noted throughout this book, the traditional hypothesis 
testing model employs a null hypothesis and an alternative hypothesis. The null hypothesis is 
a statement of zero difference, which is commensurate with saying that in the underlying popu- 
lation(s) there is zero effect/correlation present. During the past 25 years an increasing number 
of researchers and statisticians have become critical of the conventional hypothesis testing model. 
Among those who have addressed this issue are Cohen (1994), Harlow et al. (1997), Meehl 
(1978), Morrison and Henkel (1970), Murphy and Myors (1998), and Serlin and Lapsley (1985, 
1993). 

The crux of the argument against the traditional hypothesis testing model is that, in reality, 
the null hypothesis is always false. Specifically, various sources note that the null hypothesis is 
a point hypothesis, in that it stipulates a precise value — namely zero — for the difference 
between the experimental conditions. Thus, any difference, no matter how negligible, will 
provide sufficient grounds for rejecting the null hypothesis. It has been pointed out by numerous 
researchers that the actual difference between two experimental conditions is probably never 
exactly equal to zero. Although admittedly a difference may be close to zero, if our measuring 
instrument is sufficiently sensitive and we carry our measurements out to many decimal places, 
we will probably never record a difference that is exactly equal to zero. And if the latter is true, 
it means that the null hypothesis will always be false. 

If, in fact, the null hypothesis is always false, it logically follows that it is not possible to 
commit a Type I error (which is rejecting a true null hypothesis). If a Type I error becomes 
impossible, then the only type of error a researcher need concern herself with is a Type II error 
(which is not rejecting a false null hypothesis). It was noted earlier in the book, that the likelihood 
of committing a Type II error is inversely related to the power of a statistical test — i.e., the more 
powerful the test, the lower the likelihood of committing a Type II error. If a researcher wishes 
to achieve a specific level of power for a statistical test, prior to conducting the test one must 
stipulate the magnitude of effect size one is trying to detect. The smaller the effect size, the 
greater the power that will be required for the test. There are essentially three ways a researcher 
can increase the power of a statistical test. They are: a) Reduction of error variability; b) In- 
creasing the value of alpha (i.e., p value) employed in determining statistical significance; and 
c) Increasing the size of the sample. If a researcher assumes she is unable to reduce error 
variability any more than she already has, in order to increase power she must either employ a 
higher p value and/or increase the size of the sample employed in a study. Since current scientific 
convention does not endorse the use of an alpha level larger than .05, at this point in time it 
probably isn't reasonable to expect that higher alpha levels will be an acceptable mechanism for 
increasing power. Thus, the most practical and effective way to maximize power is by increasing 
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sample size. How large a sample one employs should be dictated by the magnitude of effect size 
one is trying to detect (which obviously involves a subjective decision on the part of a 
researcher). Within the context of the traditional hypothesis testing model, it will require a 
relatively large sample size to declare a result significant if the effect size present is small or even 
in the low medium range. As Murphy and Myors (1998) note, the outcome of a test of statistical 
significance employed within the traditional model provides more of a commentary on the power 
of the statistical test than it does on the strength of the relationship (1.e., effect size) between the 
variables under study. 

In the final analysis, since the null hypothesis is always wrong, Murphy and Myors (1998) 
state that a researcher can never design a study that has too much power. They recommend that 
in reference to the effect size one is trying to detect, whenever possible the power of a test should 
be at least .50, and ideally .80 or greater. If the power of a study is so low that there is little 
likelihood of detecting the hypothesized effect size, the study probably isn't worth conducting. 
Consequently, prior to conducting a study a researcher should determine (based on previous 
research or theoretical conjecture) what she considers to be a meaningful effect size, and employ 
a large enough sample to insure a reasonable likelihood of detecting it (if, in fact, it is present). 

The minimum-effect hypothesis testing model An alternative that has been suggested 
to the traditional hypothesis testing model is the minimum-effect hypothesis testing model, 
which is described in detail by Murphy and Moyers (1998). Based on papers by, among others, 
Meehl (1978) and Serlin and Lapsley (1985, 1993), the model employs the null hypothesis to 
stipulate a value below which any effect present in the data would be viewed as trivial, and above 
which would be meaningful. As an example, if one were comparing the IQ scores of two groups, 
the null hypothesis might stipulate a difference between 0 and 5 points, while the alternative 
hypothesis would stipulate a difference greater than five points. In such a case, any difference 
of five points or less would result in retaining the null hypothesis, since a difference within that 
range would be considered trivial (i.e., of no practical or theoretical value). A difference of more 
than five points would lead to rejection of the null hypothesis, since a difference equal to or 
greater than five points would be considered meaningful. Note that the null hypothesis in the 
minimum-effect model stipulates a range of values, whereas in the traditional hypothesis testing 
model the null hypothesis stipulates a point (i.e., a specific value). 

Murphy and Moyers (1998) discuss stipulating differences within the minimum-effect 
hypothesis testing model in terms of effect size. They suggest the null hypothesis could stipulate 
a range of values in which the effects of the treatment account for what is considered to be a 
negligible/trivial amount of variability. They suggest one might stipulate in the null hypothesis 
that between 0 and 1% of variance on the dependent variable can be accounted for by variation 
on the independent variable. The alternative hypothesis would be supported if an effect size of 
greater than 1% is detected. The value of 1% corresponds to the lower limit of Cohen's (1977, 
1988) minimum value for a small effect size noted earlier in this section, when an r value is em- 
ployed to measure effect size (i.e., r? = (.1) = .01, which, expressed as a percentage, is 1%). 
Murphy and Moyers (1998) have developed special tables based on the noncentral F 
distribution (alluded to previously in Section VI of the single-factor between-subjects analysis 
of variance under the discussion of power) for evaluating a minimum-effect null hypothesis. In 
contrast to the central F distribution, which has been used throughout the book for evaluating 
F values within the context of the traditional hypothesis testing model, the noncentral F dis- 
tribution can be employed to evaluate a minimum-effect null hypothesis. The tables developed 
by Murphy and Moyers (1998) allow for testing significance at the .05 and .01 levels, and also 
provide for an analysis of power. In addition to evaluating a minimum-effect null hypothesis that 
stipulates an effect size of 196 or less as a trivial, the tables also evaluate a null hypothesis that 
stipulates an effect size of 596 or less as a trivial (since Murphy and Moyers (1998) believe some 
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researchers might prefer to employ the latter value as the upper limit for a trivial effect). Murphy 
and Moyers (1998) also provide equations for converting various test statistics (e.g., X?, R?, 
etc.) into F values, in order that they can be employed with the tables for the noncentral F 
distribution. It should be noted that for a given set of data, the minimum-effect hypothesis testing 
model will have lower power than the traditional hypothesis testing model. This is the case, since 
it is easier to reject a null hypothesis stipulating a zero difference than it is to reject a null 
hypothesis that stipulates a range of values between zero and some number above it. Murphy 
and Moyers (1998) note that the loss of power is offset by the fact that the minimum-effect model 
provides the researcher with more meaningful results. 

The minimum-effect hypothesis testing model is more compatible with the hypothesis 
testing philosophies of the major proponents of meta-analysis (e.g., Rosenthal and Hedges) than 
is the traditional hypothesis testing model. In both the minimum-effect model and meta-analysis, 
greater emphasis is placed on effect size than on the level of significance. In the final analysis, 
when all is said and done, regardless of which hypothesis testing model one employs, the key to 
effective hypothesis testing ultimately boils down to employing a representative sample that is 
large enough to detect any meaningful effect(s) present in the underlying population(s). 
Research that abides by the latter principle will ultimately yield results that are both reliable and 
meaningful. 
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Endnotes 


1. Itis also possible to designate the Y variable as the predictor variable and the X variable as 
the criterion variable. The use of the Y variable as the predictor variable is discussed in 
Section VI. 


2.  Itshould be noted that when the joint distribution of two variables is bivariate normal, only 
a linear relationship can exist between the variables. As a result of the latter, whenever the 
population correlation between two bivariate normally distributed variables equals zero, one 
can conclude that the variables are statistically independent of one another. Under such 
conditions the null hypothesis H): p = 0 stated in this section is equivalent to the null 
hypothesis that the two variables are independent of one another. On the other hand, it is 
possible for each of two variables to be normally distributed, yet the joint distribution of 
the two variables not be bivariate normal. When the latter is true, it possible to compute 
the value r = 0, and at the same time have two variables that are statistically dependent upon 
one another. Statistical dependence in such a case will be the result of the fact that the 
variables are curvilinearly related to one another. 


3. Howell (1992, 1997) notes that the value of r computed with Equation 28.1 is a biased 
estimate of the underlying population parameter p. The degree to which the computed 
value of r is biased is inversely related to the size of the sample employed in computing the 
correlation coefficient. For this reason, when one employs correlational data within the 
framework of research, it is always recommended that a reasonably large sample size be 
employed. 

One way of correcting for bias resulting from a small sample size is to employ 
the equation noted below to compute the value 7, which represents a relatively unbiased 
estimate of the population parameter p. The value f is referred to as a “shrunken” or 
“adjusted” estimate of the population correlation. The computation of 7 (the absolute value 
of which will always be less than r) is demonstrated for Example 28.1. 
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Thus, in the case of Example 28.1, 7 = .940 provides a better estimate of the true 
population correlation than r = .955 (although even if 7 is employed, n = 5 is an absurdly 
low sample size to employ within the framework of serious research). Since most sources 
use the computed value of r rather than 7, the former value will be employed as the 
estimate of the population correlation throughout the discussion of the Pearson product- 
moment correlation coefficient. 


4. An alternative form of Equation 28.3 that is based on the relationship t? = F (described 
in Section VII of the single-factor between-subjects analysis of variance), which yields 
equivalent results, employs the F distribution. The equation employing the F distribution 
for evaluating the significance of r is noted below. 


- rn -2) 
Ub egh 


F 
l-r 


Employing the above equation with Example 28.1, the value F = 31.10 is computed. 


_ (955)5 - 3) 
1 - (.955)? 


F - 31.10 


The computed value F = 31.10 (which is equivalent to (t = 5.58)? if rounding off 
error is ignored) is evaluated with Table A10. The degrees of freedom employed in 
evaluating the above equation are df... = 1, dfi, =n - 2. Thus, df, = 1 and df, = 3. 
It is determined that the tabled critical .05 and .01 two-tailed values are F9, = 10.13 and 
F = 34.12, and the tabled critical .05 and .01 one-tailed values are Fj, = 5.54 and 
F = 20.61. (The latter values, which are not in Table A10, were obtained by squaring 
the tabled critical one-tailed values ft), = 2.35 and t9, = 4.54. A full discussion of one- 
tailed F values can be found in Section VI of the ¢ test for two independent samples under 
the discussion of homogeneity of variance.) The same guidelines for interpreting a com- 
puted t value with Equation 28.3 are employed to interpret the computed F value, with one 
exception in reference to the directional alternative hypothesis H,: p < 0. Since the 
value of F will always be a positive number, if the directional alternative H,: p < O is 
employed, in order to reject the null hypothesis the value of F must be equal to or greater 
than the tabled critical one-tailed F value at the prespecified level of significance. 
However, the sign of r must be negative. When the F distribution is employed to evaluate 
the null hypothesis Hy: p = 0, it results in identical conclusions to those reached when 
Equation 28.3 is employed. Specifically, the nondirectional alternative hypothesis 
H,: p # O is supported at the .05 level, since F = 31.10 is greater than the tabled critical 
two-tailed value Fs = 10.13. The directional alternative hypothesis H,: p > 0 is 
supported at both the .05 and .01 levels, since r is a positive number, and F = 31.10 is 
greater than the tabled critical one-tailed values F,, = 5.54 and F,, = 20.61. 

Marascuilo and Serlin (1988) note that the following equation employing the normal 
distribution can also be used with large sample sizes to evaluate the null hypothesis 
Hy: p = 0: z = ryn - 1. If applied to Example 28.1, the value z = (.955)/5 - 1 
= 1.91 is computed. The value z = 1.91 only supports the directional alternative hypothesis 
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H,: p > O atthe .05 level, since z= 1.91 is greater than the tabled critical one-tailed value 
Zos = 1.65 in Table A1 in the Appendix. The nondirectional alternative hypothesis 
H,: p # O is not supported, since z = 1.91 is less than the tabled critical two-tailed 
value zo, = 1.96. The latter result indicates that when employed with a small sample size, 
the normal approximation provides a more conservative test of an alternative hypothesis 
than does Equation 28.3. 


5. a) The value (1 - r°) is often referred to as the coefficient of nondetermination, since 
it represents the proportion of variance that the two variables do not hold in common with 
one another. Further discussion of the coefficient of determination can be found in 
Section IX (the Addendum) in the discussion of meta-analysis and related topics; b) 
Ozer (1985) argues that under certain conditions it is more prudent to employ |r| as a 
measure of the proportion of variance on one variable that can be accounted for by 
variability on the other variable. Cohen (1988, p. 533) succinctly summarizes Ozer's 
(1985) point by noting that when there is reason to believe that a causal relationship exists 
between X and Y, the value r? provides an appropriate estimate of the 
percentage/proportion of variance on Y attributable to X. However, if there is reason to 
believe that both X and Y are caused by a third variable, the absolute value of r is a more 
appropriate measure to employ to represent the proportion of shared variance between X 
and Y. 


6.  Thereader should keep in mind that for illustrative purposes the sample size employed for 
Example 28.1 is very small. Consequently, the values r and r? are, in all likelihood, not 
accurate estimates of the corresponding underlying population parameters p and p°. 


7. An equation that is based on the minimum squared distance of all the points from the line 
reflects the fact that if the distance of each data point from the line is measured, and the 
resulting value is squared, the sum of the squared values for the n data points is the lowest 
possible value that can be obtained for that set of data. 


8. The values 5, y and sy , can also be computed with the equations noted below: 








- 92 
2 
294.8 - (88.0) 
29.2 _ 2.94 
5-2 


9.  Thereader may find it useful to review the discussion of confidence intervals in Section VI 
of the single-sample f test before reading this section. 


10. The term SS, in Equation 28.14 may also be written in the form SS, = (n - 1)82, and 
the term SS, in Equation 28.15 may also be written in the form SS, - (n - 1)87. 
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11. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


Zar (1999, p. 382) notes that the following equation can be employed to convert a z, 
value into an r value: r - (e^* - D/(?* * ]). Thus, if 


z, = 1.886, r = [Q.71828)20.99 - 1]/[(2.7182804-889 + 1] = .955. 


The equation r = tanh z, can also be employed to convert a z, value into an r value (where 
tanh represents the hyperbolic tangent of a number). Thus, tanh 1.886 = .955. Scientific 
calculators generally have keys that allow for quick computation of a tanh value. 


Equation 28.20 can also be written in the form: z = (z, - Z) vn -3. Thus, 
z = (1.886 - 1.00945 - 3 = 1.11. 


The value n 2 5 employed in Example 28.1 is not used, since the method to be described 
is recommended when n » 25. For smaller sample sizes, tables in Cohen (1977, 1988) 
derived by David (1938) can be employed. 


For the analysis described in this section the df = œ curve is employed for the relevant set 
of power curves, since Fisher's z, transformation is based on the normal distribution. 


Equation 28.24 can also be employed to evaluate the hypothesis of whether there is a 
significant difference between k = 2 independent correlations — i.e., the same hypothesis 
evaluated with Equation 28.22. When k = 2, the result obtained with Equation 28.24 will 
be equivalent to the result obtained with Equation 28.22. Specifically, the square of the 
obtained value of z obtained with Equation 28.22 will equal the value of x? obtained with 
Equation 28.24. Thus, if the data employed in Equation 28.22 are employed in Equation 
28.24, the obtained value of chi-square equals X? = z? = (.878) = .771. 


[5 - 3)(1.886) + (5 - 3)(1.008)? _ 


25 = 2.4 = 2] - 
X = [65 - 3)(1.886)° + (5 - 3(1.008)] 6-3) + (5-3) 


771 


When k = 2, Equation 28.25 is equivalent to Equation 28.23. 


If homogeneity of variance is assumed for the two samples, a pooled error variance can be 
computed as follows: 


Oh, SOG rey By SOs a) 


Sy x = 





n,+n,-4 


"HN xoc gh RC 
Sx y = 





n *n,-4 


The computed value sy x is used in place of both sy x, and sy x, in Equation 28.31, 


and the computed value sy y is used in place of both sy. y, and d y, in Equation 28.32. 


Marascuilo and Serlin (1988) describe how the procedure described in this section can 
be extended to the evaluation of a hypothesis contrasting three or more regression coeffi- 
cients. 
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19. 


20. 


21. 


22. 


23. 


24. 


25. 


26. 


27. 


28. 


29. 


The equations for z, and z, are analogous to Equation I.27 in the Introduction (which 


employs the population parameters u and o in computing a z score). 


The sum of products within this context is not the same as the sum of products that 
represents the numerator of Equation 28.1. 


One way of avoiding the problem of dependent pairs is to form pairs in which no digit is 
used for more than one pair. In other words, the first two digits in the series represent the 
X and Y variables for first pair, the third and fourth digits in the series represent the X and 
Y variables for second pair, and so on. Although use of the latter methodology really does 
not conform to the definition of autocorrelation, if itis employed one can justify employing 
the critical values in Table A16. 


A discussion of the derivation of Equation 28.41 can be found in Bennett and Franklin 
(1954). 


The reader should take note of the fact that the data for Example 28.5 are fictitious and, 
in reality the result of the analysis in this section may not be consistent with actual studies 
that have been conducted which evaluate the relationship between intelligence and eye- 
hand coordination. 


Although the phi coefficient is described in the book as a measure of association for the 
chi-square test for r x c tables (specifically, for 2 x 2 tables), it is also employed in 
psychological testing as a measure of association for 2 x 2 tables in order to evaluate the 
consistency of n subjects’ responses to two questions. The latter type of analysis is 
essentially a dependent samples analysis for a 2 x 2 table, which, in fact, is the general 
model for which the McNemar test (Test 20) is employed. 


Itis also the case that the greater the number of predictor variables in a set of data involving 
a fixed number of subjects, the larger the value of R?. 


This principle has obviously not been adhered to in the example under discussion in order 
to minimize computations. 


Tabachnick and Fidell (1989, 1996) note that, for small sample sizes, some sources recom- 
mend an even more severe adjustment than that which results from using Equation 28.56. 


The equation noted below is equivalent to Equation 28.62. 


2 
n-k-1 5-2-1 l 


The following should be noted with respect to Equation 28.69: a) When the sample size is 
small and/or the number of subjects is not substantially larger than the number of predictor 
variables, the “shrunken” estimate R g (computed with Equation 28.56) should be employed 
in Equation 28.69; b) When there are more than two predictor variables, the multiple 
correlation coefficient for the k variables is employed in the numerator of the radical of 





Sy xx, 7 
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30. 


31. 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


Equation 28.69 in place of n xx, The value rx, x, in the denominator of the radical of 
Equation 28.69 is replaced by the squared multiple correlation coefficient of variable i with 
all of the remaining predictor variables. Thus, if there are three predictor variables and s, 
is computed, the values employed in the numerator and denominator of the radical are, 
respectively, Ry x,x,x, and Ry XX" 


Howell (1992, 1997) cites sources who argue that the ¢ distribution does not provide a 
precise approximation of the underlying sampling distribution for the standard error of 
estimate of the coefficients. On the basis of this he states that caution should be employed 
in interpreting the results of the f test. 


The same results are obtained if the analysis is done employing the standardized regression 
coefficients. This is demonstrated below employing the appropriate equations for the stand- 
ardized coefficients. The minimal discrepancy between the values fp. and ty is due to 
rounding off error. 





1 - Ry 
ES E 
i qü-rgG-k-Dn NI - (375 - 2 - 1] 
pof und caa 9 1359 49 
(os ®™ 80 > o .180 


Marascuilo and Levin (1983) and Marascuilo and Serlin (1988) recommend that in order 
to control the Type I error rate, a more conservative t value should be employed when the 
number of regression coefficients evaluated is greater than one. These sources describe the 
use of the Bonferroni-Dunn and Scheffé procedures (which are described in reference 
to multiple comparisons for analysis of variance procedures) in adjusting the t value. 


The computed value r, can also be evaluated through use of the critical values in Table 
A16 (for df =n - v). 


Note that when a simple bivariate/zero order correlation is computed, n - v 2 n - 2, and 
thus Equation 28.73 becomes identical to Equation 28.3 (which is used to evaluate the 
significance of the zero order correlation coefficient ry y ). 

172 
The computed value p, can also be evaluated through use of the critical values in Table 
A16 (for df 2 n - v). 


As is the case for Equation 28.73, Equation 28.75 becomes identical to Equation 28.3 when 
n-vzn-2. 


Cohen (1977, 1988) also discusses additional effect size indices that are employed for 
computing the power of various multivariate procedures. 


a) The values that Cohen (1977; 1988, pp. 24-27) employs for identifying a small versus 
medium versus large effect size for the d index and other indices to be described in this 
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39. 


40. 


41. 


42. 


43. 


44. 


45. 


46. 


section were developed in reference to behavioral science research. Although these values 
can be employed for research in areas other than the behavioral sciences, it is conceivable 
that practitioners in other disciplines may elect to employ different values which they deem 
more appropriate for their area of specialization; b) Equation 28.77 can also be employed 
to convert an omega squared value (à?) (discussed in Section VI of the ¢ test for two 
independent samples) into a d value. If the values .0099, .0588, and .1379 (which are 
Cohen's lower limits for omega squared for a small, medium, and large effect size) are 
employed to represent r? in Equation 28.77, they yield the following corresponding d 
values: .2, .5, and .8; c) Further clarification of the relationship between r and Cohen's d 
index can be found in Cohen (1977; 1988, pp. 81-83). 


Although Cohen's d index is also used as a measure of effect size within the framework 
of meta-analysis, Rosenthal (1991, pp. 17—18) argues that the use of an r value is preferable 
to the use of a d value as a measure of effect size. 


The author is indebted to Robert Rosenthal for clarifying some of the issues discussed in 
this section. 


It is important to emphasize that researchers are often not in agreement with regard to the 
most appropriate estimate of effect size to employ. Hopefully, if an effect of some magni- 
tude is present which has theoretical or practical implications, regardless of which measure 
of effect size one employs, a reasonably accurate estimate of the effect size will emerge 
from an analysis. 


a) The same test result will be obtained if the z value for Study E is assigned a positive sign, 
and the z values for Studies A, B, C, and D are assigned negative signs; b) If in a given 
study the means of the two groups are equal, the p value for that study will equal .50, and 
the corresponding z value will equal 0. 


Rosenthal (1991) presents a modified form of Equation 28.86 that allows a researcher to 
differentially weight the k studies employed in a meta-analysis. A weighting system is 
employed within this context to reflect the relative quality of each of the studies. The mag- 
nitude of the weights (which are assigned by a panel of judges) are supposed to be a direct 
function of the quality of a study. 


It is interesting to note that the average z value for five studies is z, = .95, and that 
the latter z value in itself is not statistically significant. It is quite common for Equation 
28.86 to yield a significant combined p value when the average z value in itself is not 
statistically significant. 


The same test result will be obtained if, instead, we assign a negative sign to the Fisher 
transformed z, values for Studies A, B, C, and D, and a positive sign for the Fisher 
transformed z, value for Study E. 


As noted under the discussion of the odds ratio in Section VI of the chi-square test for 
r x c tables, the values of the odds ratio and the relative risk will be very close together 
when the event in question (in this case a heart attack) has a low proability of occurring. 
The likelihood of someone in the placebo group having a heart attack is 189/11034 
= .01713, while the likelihood of someone in the aspirin group having a heart attack is 
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104/11037 = .00942 (note that .01713/.00942 = 1.82, which is the value of relative risk). 
Thus, the values computed for the odds ratio and the relative risk are almost identical. 


47. An alternative way of presenting effect size for a contingency table which (like the odds 
ratio) may make it more apparent if a seemingly small effect is of practical consequence 


is the binomial effect size display (BESD) developed by Rosenthal and Rubin (1982). 


48. Wolf (1986) describes a nonparametric measure of effect size that can be employed in 
meta-analysis. 
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Test 29 


Spearman's Rank-Order Correlation Coefficient 
(Nonparametric Measure of Association/Correlation 
Employed with Ordinal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Spearman's rank-order correlation coefficient is one of a number of measures of correlation 
or association discussed in this book. Measures of correlation are not inferential statistical tests, 
but are, instead, descriptive statistical measures that represent the degree of relationship between 
two or more variables. Upon computing a measure of correlation, it is common practice to 
employ one or more inferential statistical tests in order to evaluate one or more hypotheses con- 
cerning the correlation coefficient. The hypothesis stated below is the most commonly evaluated 
hypothesis for Spearman's rank-order correlation coefficient. 


Hypothesis evaluated with test In the underlying population represented by a sample, is the 
correlation between subjects’ scores on two variables some value other than zero? The latter 
hypothesis can also be stated in the following form: In the underlying population represented 
by the sample, is there a significant monotonic relationship between the two variables? It is 
important to note that the nature of the relationship described by Spearman's rank-order 
correlation coefficient is based on an analysis of two sets of ranks. 


Relevant background information on test Prior to reading the material in this section the 
reader should review the general discussion of correlation in Section I of the Pearson product- 
moment correlation coefficient (Test 28). Developed by Spearman (1904), Spearman's rank- 
order correlation coefficient is a bivariate measure of correlation/association that is employed 
with rank-order data. The population parameter estimated by the correlation coefficient will be 
represented by the notation p. (where p is the lower case Greek letter rho). The sample statistic 
computed to estimate the value of p, will be represented by the notation r,. In point of fact, 
Spearman's rank-order correlation coefficient is a special case of the Pearson product- 
moment correlation coefficient, when the latter measure is computed for two sets of ranks. The 
relationship between Spearman's rank-order correlation coefficient and the Pearson product- 
moment correlation coefficient is discussed in Section VI. 

As is the case for the Pearson product-moment correlation coefficient, Spearman's 
rank-order correlation coefficient can be employed to evaluate data for n subjects, each of 
whom has contributed a score on two variables (designated as the X and Y variables). Within 
each of the variables, the n scores are rank-ordered. Spearman's rank-order correlation 
coefficient is also commonly employed to evaluate the degree of agreement between the rankings 
of m = 2 judges for n subjects/objects. 

Incomputing Spearman's rank-order correlation coefficient, one of the following is true 
with regard to the rank-order data that are evaluated: a) The data for both variables are in a rank- 
order format, since it is the only format for which data are available; b) The original data are in 
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a rank-order format for one variable and in an interval/ratio format for the second variable. In 
such an instance, data on the second variable are converted to a rank-order format in order that 
both sets of data represent the same level of measurement; and c) The data for both variables 
have been transformed into a rank order-format from an interval/ratio format, since the researcher 
has reason to believe that one or more of the assumptions underlying the Pearson product- 
moment correlation coefficient (which is the analogous parametric correlational procedure 
employed for interval/ratio data) have been saliently violated. It should be noted that since 
information is sacrificed when interval/ratio data are transformed into a rank-order format, some 
researchers may elect to employ the Pearson product-moment correlation coefficient rather 
than Spearman's rank-order correlation coefficient, even when there is reason to believe that 
one or more of the assumptions of the former measure have been violated. 

Spearman's rank-order correlation coefficient determines the degree to which a mono- 
tonic relationship exists between two variables. A monotonic relationship can be described as 
monotonic increasing (which is associated with a positive correlation) or monotonic decreasing 
(which is associated with a negative correlation). A relationship between two variables is 
monotonic increasing, if an increase in the value of one variable is always accompanied by an 
increase in the value of the other variable. A relationship between two variables is monotonic 
decreasing, if an increase in the value of one variable is always accompanied by an decrease in 
the value of the other variable. Based on the above definitions, a positively sloped straight line 
represents an example of a monotonic increasing function, while a negatively sloped straight line 
represents an example of a monotonic decreasing function. In addition to the aforementioned 
linear functions, curvilinear functions can also be monotonic. For instance, the function Y = X? 
depicted in Figure 29.1 represents an example of a monotonic increasing function, since an 
increase in the X variable always results in an increase in Y variable. It should be noted that 
when the interval/ratio scores on two variables are monotonically related to one another, a linear 
function can be employed to describe the relationship between the rank-orderings of the two 
variables. This latter fact is demonstrated in Section VI. 


25 
20 


15 
Y 


10 
5 


0 
0 5 10 15 


X 
Figure 29.1 Monotonic Increasing Relationship (Y = X?) 


The same general guidelines that are described for interpreting the value of the Pearson 
product-moment correlation coefficient can be applied to Spearman's rank-order correlation 
coefficient. Thus, the range of values r, can assume is defined by the limits -1 to +1 (i.e., 


-1 < ry < +1). The absolute value of r, (i.e. | r,|) indicates the strength of the relationship 
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between the two variables. As the absolute value of r, approaches 1, the strength of the 
monotonic relationship increases, being the strongest when r, equals either +1 or - 1. The closer 
the absolute value of Fs is to 0, the weaker the monotonic relationship between the two vari- 
ables, and, when r, - 0, no monotonic relationship is present. The sign of r, indicates 
the direction of the monotonic relationship (i.e., positive/increasing monotonic versus negative/ 
decreasing monotonic). As is the case for the Pearson product-moment correlation 
coefficient, a positive correlation indicates that an increase (decrease) on one variable is 
associated with an increase (decrease) on the other variable. A negative correlation indicates that 
an increase (decrease) on one variable is associated with a decrease (increase) on the other 
variable. 

It is important to note that correlation does not imply causation. Consequently, if there is 
a strong correlation between two variables (i.e., the absolute value of r, is close to 1), a re- 
searcher is not justified in concluding that one variable causes the other variable. Although it is 
possible that when a strong correlation exists one variable may, in fact, cause the other variable, 
the information employed in computing Spearman’s rank-order correlation coefficient does 
not allow a researcher to draw such a conclusion. This is the case, since extraneous variables that 
have not been taken into account by the researcher can be responsible for the observed 
correlation between the two variables. 


II. Example 


Example 29.1 is identical to Example 28.1 (which is evaluated with the Pearson product- 
moment correlation coefficient). In evaluating Example 29.1 it will be assumed that the ratio 
data are rank-ordered, since one or more of the assumptions of the Pearson product-moment 
correlation coefficient have been saliently violated.' 


Example 29.1 A psychologist conducts a study employing a sample of five children to 
determine whether there is a statistical relationship between the number of ounces of sugar a ten- 
year-old child eats per week (which will represent the X variable) and the number of cavities in 
a child's mouth (which will represent the Y variable). The two scores (ounces of sugar consumed 
per week and number of cavities) obtained for each of the five children follow: Child 1 (20, 7); 
Child 2 (0, 0); Child 3 (1, 2); Child 4 (12, 5); Child 5 (3, 3). Is there a significant correlation 


between sugar consumption and the number of cavities? 
III. Null versus Alternative Hypotheses 


Upon computing Spearman's rank-order correlation coefficient, it is common practice to 
determine whether the obtained absolute value of the correlation coefficient is large enough to 
allow a researcher to conclude that the underlying population correlation coefficient between the 
two variables is some value other than zero. Section V describes how the latter hypothesis, 
which is stated below, can be evaluated through use of tables of critical r, values or through use 
of an inferential statistical test that is based on either the ¢ or z distributions. 


Null hypothesis Hy: Ps = 9 


(In the underlying population the sample represents, the correlation between the ranks of subjects 
on Variable X and Variable Y equals 0.) 


Alternative hypothesis H: ps # 0 


(In the underlying population the sample represents, the correlation between the ranks of subjects 
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on Variable X and Variable Y equals some value other than 0. This is a nondirectional 
alternative hypothesis, and it is evaluated with a two-tailed test. Either a significant positive 
r, value or a significant negative r, value will provide support for this alternative hypothesis. 
In order to be significant, the obtained absolute value of r, must be equal to or greater than the 
tabled critical two-tailed r, value at the prespecified level of significance.) 


or 
H,: ps > 0 


(In the underlying population the sample represents, the correlation between the ranks of sub- 
jects on Variable X and Variable Y equals some value greater than 0. This is a directional 
alternative hypothesis, and it is evaluated with a one-tailed test. Only a significant positive 
r, value will provide support for this alternative hypothesis. In order to be significant (in 
addition to the requirement of a positive r, value), the obtained absolute value of r, must be 
equal to or greater than the tabled critical one-tailed r, value at the prespecified level of 
significance.) 


or 
H: p, < 0 


(In the underlying population the sample represents, the correlation between the ranks of subjects 
on Variable X and Variable Y equals some value less than 0. This is a directional alternative 
hypothesis, and it is evaluated with a one-tailed test. Only a significant negative r, value 
will provide support for this alternative hypothesis. In order to be significant (in addition to the 
requirement of a negative r, value), the obtained absolute value of r, must be equal to or 
greater than the tabled critical one-tailed r, value at the prespecified level of significance.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected.’ 


IV. Test Computations 


Table 29.1 summarizes the data for Example 29.1. The following should be noted with respect 
to Table 29.1: a) The number of subjects is n =5. Each subject has an X score and a Y score, and 
thus there are five X scores and five Y scores; b) The rankings of the five subjects' scores on the 
X and Y variables are respectively recorded in the columns labelled Ry and Ry; c) The column 
labelled d = Ry - Ry contains a difference score for each subject, which is obtained by sub- 
tracting a subject's rank on the Y variable from the subject's rank on the X variable; and d) The 
column labelled d? contains the square of each subject's difference score. 


Table 29.1 Summary of Data for Example 29.1 


Subject x Ry Y R, d- Ry -Ry d 
1 20 5 7 5 0 0 

2 0 1 0 1 0 0 

3 1 2 2 2 0 0 

4 12 4 5 4 0 0 

5 3 3 3 3 0 0 
Yd = 0 Lad? = 0 
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The ranking protocol employed in Table 29.1 is identical to that employed for the Mann- 
Whitney U test (Test 12). Whereas in the case of the latter test the scores of subjects are ranked 
within each group, in the computation of Spearman's rho the scores of the n = 5 subjects are 
ranked within each of the variables. Thus, in Table 29.1 the five subjects’ X scores are ranked 
such that a rank of 1 is assigned to the lowest score on the X variable, a rank of 2 is assigned to 
the next lowest score on the X variable, and so on until a rank of 5 is assigned to the highest score 
on the X variable. The identical ranking procedure is employed with respect to the Y scores (i.e., 
a rank of 1 is assigned to the lowest score on the Y variable, a rank of 2 is assigned to the next 
lowest score on the Y variable, and so on until a rank of 5 is assigned to the highest score on the 
Y variable). In the event of tied scores (which do not occur in Example 29.1), as is the case for 
other rank-order procedures, the average of the ranks involved is assigned to all scores tied for 
a given rank. 

It should be noted that it is permissible to reverse the ranking protocol described above. 
Specifically, for each variable a rank of 1 can be assigned to the highest score on that variable 
and a rank of 5 to the lowest score on that variable. Employing this alternative ranking protocol 
will yield the identical value for r, as the one yielded by the ranking protocol employed in Table 
29.]. It should be emphasized that regardless of which ranking protocol is employed, the same 
protocol must be employed for both variables. The protocol of assigning the lowest rank to the 
lowest score and the highest rank to the highest score is employed in Example 29.1, since it 
allows for easiest interpretation of the results of the study. 

In Column 6 of Table 29.1, the sum of the difference scores is computed to be Xd = 0. In 
point fact, Xd will always equal zero and if Xd is some value other than zero, it indicates that an 
error has been made in the rankings and/or computations. In the last column of Table 29.1, the 
sum of the squared difference scores (Xd? - 0) is computed. This latter value (which will 
only equal zero when r, = 1) and the value of n are employed in Equation 29.1, which is the 
equation for computing Spearman's rank-order correlation coefficient.? 


2 
r= 1 - 934 (Equation 29.1) 


n(n? - 1) 


Substituting the appropriate values in Equation 29.1, the value r, = 1 is computed. 


202 OM _; 
SIG - 1] 


V. Interpretation of the Test Results 


The obtained value r = 1 is evaluated with Table A18 (Table of Critical Values for 
Spearman's Rho) in the Appendix. The critical values in Table A18 are listed in reference to 
n." Employing Table A18, it can be determined that the tabled critical two-tailed r, value at the 
.05 level of significance is Te 1. Because of the small sample size, it is not possible to 
evaluate the nondirectional null hypothesis at the .01 level. The tabled critical one-tailed r, 
values at the .05 and .01 levels of significance are i. = .90 and fsa 7 1. 

The following guidelines are employed in evaluating the null hypothesis Hy: p, = 0. 

a) If the nondirectional alternative hypothesis H,: p, * 0 isemployed, the null hypothesis 
can be rejected if the obtained absolute value of r, is equal to or greater than the tabled critical 
two-tailed value at the prespecified level of significance. 

b) If the directional alternative hypothesis H,: p, > 0 isemployed, the null hypothesis can 
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be rejected if the sign of r, is positive, and the value of r, is equal to or greater than the tabled 
critical one-tailed value at the prespecified level of significance. 

c) If the directional alternative hypothesis H,: p, < 0 is employed, the null hypothesis 
can be rejected if the sign of r, is negative, and the absolute value of r, is equal to or greater 
than the tabled critical one-tailed value at the prespecified level of significance. 

Employing the above guidelines, the nondirectional alternative hypothesis H,: p, * 0 is 
supported at the .05 level, since the computed value rą = 1 is equal to the tabled critical two- 
tailed value r, = 1. The directional alternative hypothesis H,: p, > 0 is supported at both 
the .05 and .01 levels, since the computed value r, = 1 is a positive number that is equal to or 
greater than the tabled critical one-tailed values Ts .90 and r, = 1. The directional 
alternative hypothesis H,: p, < O is not supported, since the computed value rọ = 1 isa 
positive number. 

When the Pearson product-moment correlation coefficient is employed to evaluate the 
same set of data (i.e., the ratio scores of subjects are correlated with one another), the nondirec- 
tional alternative hypothesis (i.e., H,: p * 0) is also supported at only the .05 level, and the 
directional alternative hypothesis (i.e., H,: p > 0 is supported at both the .05 and .01 levels. 
Thus, in this instance, the two correlation coefficients yield comparable results. (However, since 
Pearson r is the more powerful of the two correlational procedures, it is more likely to result in 
rejection of the null hypothesis at a given level of significance when applied to the same set of 
data.) 


Test 29a: Test of significance for Spearman's rank-order correlation coefficient In the 
event a researcher does not have access to Table A18, Equation 29.2 which employs the t 
distribution, provides an alternative way of evaluating the null hypothesis Hj: p, = 0. Most 
sources that recommend Equation 29.2 state that it provides a reasonably good approximation 
of the underlying sampling distribution when n » 10. 


ren - 2 


t = ————— (Equation 29.2) 
y1 z r 


The ¢ value computed with Equation 29.2 is evaluated with Table A2 (Table of Student's 
t Distribution) in the Appendix. The degrees of freedom employed are df = n - 2. Thus, in the 
case of Example 29.1, df= 5 - 223. For df=3, the tabled critical two-tailed .05 and .01 values 
are fg, = 3.18 and f,, = 5.84, and the tabled critical one-tailed .05 and .01 values are 

= EL 35 and ft), = 4.54. Since the sign of the ¢ value computed with Equation 29.2 will 
s be the anie as the sign of r,, the guidelines described earlier in reference to Table 
A18 for evaluating an r, value can dió be applied in evaluating the t value computed with 
Equation 29.2 (i.e., gibstifute t in place of r, in the text of the guidelines for evaluating 7). 

Inspection of Equation 29.2 reveals that if the absolute value of r, equals 1, the term 
y1 - r? will equal zero, thus rendering the equation insoluble (i.e., t = [(1)/5 -2]/ 1-üyz 
?). Consequently, Equation 29.2 cannot be applied to Example 29.1. 

Equation 29.3, which employs the normal distribution, is an alternative equation for 
evaluating the significance of rọ. When the sample size is large (approximately 200 or greater), 
Equation 29.3 will yield a result that is equivalent to that obtained with Equation 29.2.5 


Z=Psyn - 1 (Equation 29.3) 


Although the sample size in Example 29.1 is well below the minimum size recommended 
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for Equation 29.3, the appropriate values will be substituted in the latter equation in order to 
demonstrate its application. Substituting the values r = 1 and n = 5 in Equation 29.3, the value 
z = 2.00 is computed. 


z-(D/5-1-2 


The computed value z = 2.00 is evaluated with Table Al (Table of the Normal 
Distribution) in the Appendix. In the latter table, the tabled critical two-tailed .05 and .01 
values are zo; = 1.96 and zy, = 2.58, and the tabled critical one-tailed .05 and .01 values are 
Zos = 1.65 and z,, = 2.33. Since the sign of the z value computed with Equation 29.3 will 
always be the same as the sign of r,, the guidelines described earlier in reference to Table A18 
for evaluating an r, value can also be applied in evaluating the z value computed with Equation 
29.3 (i.e., substitute z in place of r, in the text of the guidelines for evaluating rọ). 

Employing the guidelines, the nondirectional alternative hypothesis H,: p, * 0 is 
supported at the .05 level, since the computed value z = 2.00 is greater than the tabled critical 
two-tailed value zo, = 1.96. Itis not, however, supported at the .01 level, since z = 2.00 is less 
than the tabled critical two-tailed value zo, = 2.58. 

The directional alternative hypothesis H,: p, > 0 is supported at the .05 level, since 
the computed value z = 2.00 is a positive number that is greater than the tabled critical one-tailed 
value Z; = 1.65. It is not, however, supported at the .01 level, since z = 2.00 is less than the 
tabled critical one-tailed value zo, = 2.33. 

The directional alternative hypothesis H,: p < O is not supported, since the computed 
value z = 2.00 is a positive number. In order for the alternative hypothesis H,: p, < 0 to be 
supported, the computed value of z must be a negative number (as well as the fact that the 
absolute value of z must be equal to or greater than the tabled critical one-tailed value at the 
prespecified level of significance). Note that the results obtained through use of Equation 29.3 
are reasonably consistent with those that are obtained when Table A18 is employed.* 

A summary of the analysis of Example 29.1 follows: It can be concluded that there is a 
significant monotonic increasing/positive relationship between the number of ounces of sugar 
a ten-year-old child eats and the number of cavities in a child's mouth. This result can be 
summarized as follows (if it is assumed the nondirectional alternative hypothesis H,: p. # 0 
is employed): rọ = 1, p < .05. 


VI. Additional Analytical Procedures for Spearman's Rank-Order 
Correlation Coefficient and/or Related Tests 


1. Tie correction for Spearman's rank-order correlation coefficient When one or more ties 
are present in a set of data, many sources recommend that the r, value computed with Equation 
29.] be adjusted. The reason for this is that when ties are present, Equation 29.1 spuriously 
inflates the absolute value of r,. In practice, most of the time that ties are present the effect on 
the value of r, will be minimal (unless the number of ties is excessive). The tie correction pro- 
cedure to be demonstrated in this section will employ the data summarized in Table 29.2. 
Assume that the data are for the same variables evaluated in Example 29.1, except for the fact 
that a different set of subjects is employed with n = 10. 

Employing Equation 29.1, it is determined that the value of Spearman's rho without 
employing a tie correction is r, = .764. 


r=1-— ER. -74 
10[(0)? - 1)] 


S 
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Table 29.2 Data Employed with Tie Correction Procedure 


Subject X Ry Y Ry d= Ry -R, d 
1 0 1.5 1 2 =5 25 
2 0 1.5 0 1 5 25 
3 2 3 2 3.5 =5 25 
4 4 4 2 3.5 5 25 
5 8 6 8 9.5 -3.5 12.25 
6 8 6 8 9.5 -3.5 12.25 
7 8 6 3 5 1 1 
8 13 8 4 6 2 4 
9 16 9.5 6 7 2.5 6.25 
10 16 9.5 7 8 1.5 2.25 

Yd = 0 Ed? = 39 


The tie correction will now be introduced. In the example under discussion there are s = 
3 sets of ties involving the ranks of subjects’ X scores (Subjects 1 and 2; Subjects 5, 6, and 7; 
Subjects 9 and 10), and s = 2 sets of ties involving the ranks of subjects’ Y scores (Subjects 3 
and 4; Subjects 5 and 6). Equation 29.8 is employed to compute the tie-corrected Spearman's 
rank-order correlation coefficient, which will be represented by the notation rs Note that the 
values Xx? and Xy? in Equation 29.8 are computed with Equations 29.6 and 29.7, and that 
Equations 29.6 and 29.7 are, respectively, based on the values T, and T,, which are com- 
puted with Equations 29.4 and 29.5. In Equation 29.4, 5 represents the number of X scores that 
are tied for a given rank. In Equation 29.5, t, represents the number of Y scores that are tied 
for a given rank. The notations Liat, - f) and XP, E^ ) indicate that the following 
is done with respect to each of the variables: a) For each set of ties, the number of ties in the set 
is subtracted from the cube of the number of ties in that set; and b) The sum of all the values 
computed in part a) is obtained for that variable. 

When the data from Table 29.2 are substituted in Equations 29.4—29.8, the tie-corrected 
value Fg = .758 is computed. 


Tx = De, ~ hiy) = IQ -2] + [BF -3] + ID? -2] = 36 (Equation 29.4) 
T, - Y Hi T hy) = IQ? - 2] + IQ? - 2] = 12 (Equation 29.5) 
i-l 


= ——___* = + 279.5 (Equation 29.6) 


12 T2 
n?-n-T 3. - 
dy? = — 77 "r.(0F-10-12 85  (Equation29. 
2 12 
2 252 2 = 
ry = dat Ey - Xd? = 79.5 + 81.5 - 39 = .758 (Equation 29.8) 
a EE? 24(79.5(81.3) 


Thus, by employing the tie correction, the value of rho is reduced from the uncorrected 
value of r = .764 tor, = .758. As noted earlier, the correction is minimal. 
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2. Spearman's rank-order correlation coefficient as a special case of the Pearson product- 
moment correlation coefficient Although the procedure described in the previous section for 
dealing with ties is the one recommended in most sources, in actuality, an alternative and at times 
more computationally efficient procedure can be employed. In Section I it is noted that Spear- 
man's rank-order correlation coefficient is a special case of the Pearson product-moment 
correlation coefficient. In point of fact, if the Pearson product-moment correlation 
coefficient is computed for the rank-orders in a set of interval/ratio data, the computed r value 
will be identical to the value computed for r, with Equation 29.1. This is demonstrated below 
for Example 29.1, where Equation 28.1 (the equation for computing the Pearson product- 
moment correlation coefficient) is employed to compute the value r - r, - 1. Table 29.3 
summarizes the values that are substituted in Equation 28.1. Note that the ranks R, and R, 
employed in Table 29.1 are used in Table 29.3 to represent the scores on the X and Y variables." 


EI) 
5 





_ a5» 
10 


55 








ss a» 











10 


Table 29.3 Summary of Data for Example 29.1 for Evaluation with Equation 28.1 
Subject X x? Y Y? XY 


1 5 25 5 25 25 
2 1 1 1 1 1 
3 2 4 2 4 4 
4 4 16 4 16 16 
5 3 9 3 9 9 


XX = 15 XX? = 55 XY -15 XY? = 55 YXXY = 55 


Table 29.4 Summary of Data in Table 29.2 for Evaluation with Equation 28.1 
Subject X x? Y Y? XY 


1 1.5 2.25 2 4 3 
2 1.5 2.25 1 1 1.5 
3 3 9 3.3 12.25 10.5 
4 4 16 35 12.25 14 
5 6 36 9.5 90.25 57 
6 6 36 9.5 90.25 57 
7 6 36 5 25 30 
8 8 64 6 36 48 
9 9.5 90.25 7 49 66.5 
10 9.5 90.25 8 64 76 


XX = 55 XX? = 382 XY = 55 XY? = 384 YXY = 363.5 


When there are no ties present in the data, Equations 29.1 and 28.1 will always yield the 
identical value for rọ. However, anytime there is at least one set of ties, the values yielded by 
the two equations will not be identical. In point of fact, Howell (1992, 1997) notes that when ties 
are present in the data, the r, value computed with Equation 28.1 will be equivalent to the tie- 
corrected value rs computed with Equation 29.8. When there are no ties present in the data, it 


is clearly more efficient to employ Equation 29.1 than it is to employ Equation 28.1. However, 
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when ties are present, it can be argued that use of Equation 28.1 is more computationally efficient 
than Equation 29.8. To demonstrate the equivalency of Equation 28.1 and Equation 29.8, 
Equation 28.1 is employed below with the rank-orders in Table 29.2. Table 29.4 summarizes the 
values that are substituted in Equation 28.1. The value r = .758 obtained with Equation 28.1 is 
identical to the value Tg = .758 obtained with Equation 29.8. 





363.5 - 19000 
r= a —À .758 
2 2 
38 = BIY Iggy - CS 
10 10 


3. Regression analysis and Spearman’s rank-order correlation coefficient When Spear- 
man’s rank-order correlation coefficient is computed for a set of data, a researcher may also 
want to derive the mathematical function that best allows one to predict a subject’ s score on one 
variable through use of the subject's score on the second variable. To do this requires the use 
of regression analysis which, as noted in Section VI of the Pearson product-moment 
correlation coefficient, is a general term that describes statistical procedures which determine 
the mathematical function that best describes the relationship between two or more variables. 
One type of regression analysis that falls within the general category of nonparametric 
regression analysis is referred to as monotonic regression analysis. The latter type of analysis 
is based on the fact that if two variables (which are represented by interval/ratio data) are 
monotonically related to one another, the rankings on the variables will be linearly related to one 
another. This can be illustrated in reference to Example 29.1 through use of Figure 29.2. 
Whereas Figure 28.1 (in Section VI of the Pearson product-moment correlation coefficient) 
represents a scatterplot of the five pairs of ratio scores for Examples 28.1/29.1, Figure 29.2 is a 
scatterplot of the five pairs of ranks on the two variables. Note that the scatterplot is such that 
one can draw a positively sloped straight line which passes through all of the data points. The 
only time all of the data points will fall on the regression line is when the absolute value of the 
correlation between the two variables equals 1. Although some data points may fall on the line 
when an imperfect monotonic relationship is present, the others will not. The stronger the 
monotonic relationship, the closer the proximity of the data points to the line. 

As is noted in Section VI of the Pearson product-moment correlation coefficient, the 
most commonly employed method of regression analysis is the method of least squares (which 
is a linear regression procedure that derives the straight line which provides the best fit for a set 
of data). Although visual inspection of Figures 28.1 and 29.2 suggests a strong monotonic in- 
creasing relationship between the two variables (i.e., an increase in the number of ounces of sugar 
consumed is associated with an increase in the number of cavities), it does not allow one to pre- 
cisely determine whether the function that best describes the relationship is a straight line or a 
monotonic curve. In order to determine the latter, it is necessary to contrast the predictive accur- 
acy of the method of least squares with some alternative form of regression analysis. Conover 
(1980, 1999), who provides a bibliography on the general subject of monotonic regression 
analysis, describes its application in deriving a curve for a set of rank-ordered data. Marascuilo 
and McSweeney (1977) and Sprent (1989, 1993) also discuss the subject of monotonic regression 
analysis. In addition to sources on nonparametric statistics that discuss monotonic regression, 
many books on correlation and regression describe procedures for deriving different types of 
curvilinear functions. Daniel (1990) discusses a number of different approaches to nonparametric 
regression analysis, which derive the straight line that best describes the relationship between the 
interval/ratio scores on the two variables. These latter types of regression analysis (which employ 
the median instead of the mean as a reference point) are recommended when there is reason to 
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Y = Number of cavities 


( Ranks on Y variable) 
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X = Ounces of sugar 
( Ranks on X variable) 


Figure 29.2. Scatterplot of Ranks for Example 29.1 


believe that one or more of the assumptions underlying the method of least squares are saliently 
violated. Among those procedures Daniel (1990) describes are the Brown-Mood method 
(Brown and Mood (1951), and Mood (1950)), and a methodology developed by Theil (1950). 
Daniel (1990) also provides a comprehensive bibliography on the subject of nonparametric 
regression analysis. 


4. Partial rank correlation The computation of a partial correlation coefficient, described 
in Section IX of the Pearson product-moment correlation coefficient, can be extended to 
Spearman's rank-order correlation coefficient. Thus, when the rank-orders for three variables 
are evaluated, Equation 28.72 can be employed to compute a partial correlation coefficient for 
Spearman’s rho (employing the relevant r, values in the equation). Conover (1980, 1999) 
and Daniel (1990) discuss the computation of a partial correlation coefficient in reference to 
Spearman’s rho. 


5. Use of Fisher’s z, transformation with Spearman’s rank-order correlation coefficient 
Zar (1999) notes that when n > 10 andp, < .9 (the value of which is estimated by r,), the 
equations and procedures employing Fisher’s z, transformation that are described in reference 
to the Pearson product-moment correlation coefficient can also be employed for Spearman's 
rho. The latter procedures involve testing various hypotheses about a correlation coefficient, 
computing confidence intervals, and computing power (all of which are described in Section VI 
of the Pearson product-moment correlation coefficient). Zar (1999) notes, however, that 
when the element 1/(n - 3) (where y1/(n - 3) represents the standard error of Fisher’s z, ) 
appears in an equation, it should be replaced by the value 1.060/(n - 3) when the computa- 
tions are in reference to Spearman's rho (e.g., Equation 28.20 should be in the form 
z Xm Zo) y1.060/(n - 3) when evaluating the same hypothesis for Spearman’s rho). 


VII. Additional Discussion of Spearman's Rank-Order Correlation 
Coefficient 


1. The relationship between Spearman's rank-order correlation coefficient, Kendall's 
coefficient of concordance, and the Friedman two-way analysis of variance by ranks 
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Kendall’s coefficient of concordance (Test 31), which is discussed later in the book, is a 
measure of association that allows a researcher to evaluate the degree of agreement between m 
sets of ranks on n subjects/objects. In point of fact, Kendall’s coefficient of concordance is 
linearly related to Spearman's rank-order correlation coefficient." The underlying statistical 
model upon which Kendall’s coefficient of concordance is based is identical to the model for 
the Friedman two-way analysis of variance by ranks (Test 25). As a result of this, the 
Friedman two-way analysis of variance by ranks can be employed to determine whether the 
value of the coefficient of concordance is significant. In point of fact, the Friedman two-way 
analysis of variance by ranks can also be used to determine whether the value of Spearman's 
rho is significant. This will be illustrated with Example 29.2, which represents a type of problem 
that is commonly evaluated with Spearman's rank-order correlation coefficient (as well as 
Kendall’s coefficient of concordance when there are more than two sets of ranks). In Example 
29.2, n= 10 films (i.e., objects/subjects) are rank-ordered by m = 2 judges, and a determination 
is made with respect to the degree of agreement among the rankings of the judges. 


Example 29.2 In order to determine whether two critics agree with one another in their 
evaluation of movies, a newspaper editor asks the two critics to rank-order ten movies (assigning 
a rank of 1 to the best movie, a rank of 2 to the next best movie, etc.). Table 29.5 summarizes the 
data for the study. Is there a significant association between the two sets of ranks? 


Table 29.5 Summary of Data for Example 29.2 
Critic 1 Critic 2 
Movie Ry R, d= Ry -Ry d 
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Note that in Table 29.5 each of the n = 10 rows represents one of the ten movies, instead 
of representing n = 10 subjects (as is the case in Example 29.1). The ranks of Critic 1 are rep- 
resented in the column labelled R,, and the ranks of Critic 2 are represented in the column 
labelled Ry. Note that Critic 1 places Movies 8 and 9 in a tie for the second best movie. Thus 
(employing the protocol for tied ranks described in Section IV of the Mann-Whitney U test), 
the two ranks involved (2 and 3) are averaged ((2 + 3)/2 = 2.5), and each of the movies is 
assigned the average rank of 2.5. 

Employing Equation 29.1, the value r, = .724 is computed. The tie-corrected value 


r, = .723 (for which the calculations are not shown) is almost identical. 


S 


c 
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Employing Table A18, it is determined that for n = 10, the tabled critical two-tailed .05 
and .01 values are r, = .648 andr, = .794, and the tabled critical one-tailed .05 and .01 
05 01 


values are rọ = .564 and f. = 745. Employing the aforementioned critical values, the 
05 01 


nondirectional alternative hypothesis H,: p, * 0 and the directional alternative hypothesis 

H,: p, > 0 are supported at the .05 level, since the computed value r = .724 is greater 

than the tabled critical two-tailed value r, = .648 and the tabled critical one-tailed value 
05 


r, = .564. The alternative hypotheses are not supported at the .01 level, since r, = .724 is 


S os 
less than the tabled critical two-tailed value r, = .794 and the tabled critical one-tailed value 
.01 
Y, = .745. 
1 
If Equation 29.2 is employed to evaluate the null hypothesis Hy: p, = 0, the value t= 2.97 


is computed. 


S 


ja (MEA Lus 
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Employing Table A2, it is determined that for df= 10 - 2 = 8, the tabled critical two-tailed 
.05 and .01 values are t,; = 2.31 and t9, = 3.36, and the tabled critical one-tailed .05 and .01 
values are tf), = 1.86 and t,, = 2.90. Employing the aforementioned critical values, the 
nondirectional alternative hypothesis H,: p, # 0 is supported at the .05 level, since the 
computed value f = 2.97 is greater than the tabled critical two-tailed value ft), = 2.31. Itis not 
supported at the .01 level, since t = 2.97 is less than ft), = 3.36. The directional alternative 
hypothesis H,: p, > O is supported at both the .05 and .01 levels, since the computed value t 
= 2.97 is a positive number (since r, = .724 is a positive number) that is greater than the tabled 
critical one-tailed values £9, = 1.86 and £4, = 2.90. 

If Equation 29.3 is employed to evaluate the null hypothesis H,: p, = 0, the value 
z = 2.17 is computed. 


z = (.724)y10 - 1 = 2.17 


Employing Table A1, it is determined that the computed value z = 2.17 is greater than the 
tabled critical two-tailed value zo, = 1.96 and the tabled critical one-tailed value zo, = 1.65, 
but less than the tabled critical two-tailed value zy, = 2.58 and the tabled critical one-tailed 
value zy, = 2.33. Thus, both the nondirectional alternative hypothesis H,: p. # 0 and the 
directional alternative hypothesis H,: p, > 0 are supported at the .05 level, but not at the .01 
level. Note that identical conclusions are reached with Table A18 and Equation 29.3, but the 
latter conclusions are not identical to those obtained with Equation 29.2 (where the directional 
alternative hypothesis H,: p, > 0 is also supported at the .01 level). As noted in Section V, 
the conclusions based on use of Table A18, Equation 29.2, and Equation 29.3 will not always 
be in total agreement. 

Itis noted earlier in this section that the Friedman two-way analysis of variance by ranks 
can be employed to determine whether the value of Spearman's rho is significant. This will now 
be illustrated in reference to Example 29.2. The data for Example 29.2 are rearranged in Table 
29.6 to conform to the test model for the Friedman two-way analysis of variance by ranks. 
Note that the rows and columns employed in Table 29.5 are reversed in Table 29.6. When Table 
29.6 is employed within the framework of the Friedman test model, the two critics represent 
n = 2 subjects, and the 10 ranks represent k = 10 levels of a within-subjects/repeated-measures 
independent variable. 
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Table 29.6 Data for Example 29.2 Formatted for Analysis with 
the Friedman Two-Way Analysis of Variance by Ranks 


Movie 1 2 3 4 5 6 7 8 9 10 
Critic 1 7 1 8 10 9 6 5 2.5 2.5 4 
Critic 2 10 2 6 8 7 4 9 9 1 5 
YR, 17 3 14 18 16 10 14 5.5 3.5 9 
(oR) 289 9 196 324 256 100 196 30.25 12.25 81 


From the summary information in Table 29.6, the value 3o Ry - 1493.5 is computed. 


k 
Y (ER = 289 +9 + 196 + 324 + 256 + 100 + 196 + 30.25 + 12.25 +81 = 1493.5 


j^ 


Employing the above value, along with the other appropriate values in Equation 25.1 (the 
equation for the Friedman two-way analysis of variance by ranks), the value Xi = 15.46 is 
computed.? 

12 « 


Y (ER) - 3n(k + 1) 
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[1493.5] - (3)(2)10+1) = 15.46 








The value x = 15.46 is evaluated with Table A4 (Table of the Chi-Square 
Distribution) in the Appendix. For df= k - 1=10- 1 =9, the tabled critical two-tailed .05 and 
.01 values are Xos = 16.92 and Xo = 21.67, and the tabled critical one-tailed .05 and .01 
values are Xis = 14.68 and Xo = 19.50 (the latter value is interpolated). Employing the 
aforementioned critical values, the null hypothesis for the Friedman two-way analysis of 
variance by ranks (H): 0, = 0, = + = 0,,) can be rejected at the .05 level, but only if a one- 
tailed analysis is conducted (since xi = 15.46 is greater than the tabled critical one-tailed value 
Xas = 14.68)." The result falls short of being significant at the .05 level for a two-tailed 
analysis, since xi - 15.46 is less than the tabled critical two-tailed value yos - 16.92. 
Rejection of the null hypothesis for the Friedman two-way analysis of variance by ranks is 
commensurate with rejection of the null hypothesis H,: p, = 0 for Spearman's rank-order 
correlation coefficient. In actuality, the result derived employing the Friedman two-way 
analysis of variance by ranks is similar, but not identical, to the analysis of Spearman's rho 
with Table A 18, Equation 29.2, and Equation 29.3 (which, as noted earlier, are not in themselves 
in total agreement). The slight discrepancy between the results of the Friedman test and the 
more commonly employed methods for assessing the significance of Spearman's rho can be 
attributed to the fact that the test statistics based on the f, normal, and chi-square distributions are 
large sample approximations, which in the case of Example 29.2 are employed with a small 
sample size. It was also noted earlier, that the values in Table A18 are approximations of the 
exact values in the underlying sampling distribution. 


2. Power efficiency of Spearman's rank-order correlation coefficient Daniel (1990) and 
Siegel and Castellan (1988) note that (for large sample sizes) the asymptotic relative efficiency 
(which is discussed in Section VII of the Wilcoxon signed-ranks test (Test 6)) of Spearman's 
rank-order correlation coefficient relative to the Pearson product-moment correlation 
coefficient is approximately .91 (when the assumptions underlying the latter test are met). 
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3. Brief discussion of Kendall’s Tau: An alternative measure of association for two sets of 
ranks Kendall’s tau (Test 30) is an alternative measure of association that can be employed 
to evaluate two sets of ranks. Although Spearman's rho and Kendall’s tau can be employed 
to measure the degree of association for the same set of data, Spearman's rho is the more com- 
monly described of the two measures (primarily because it requires fewer computations). A 
comparative discussion of Spearman's rho and Kendall's tau can be found in Section I of the 
latter test. 


4. Weighted rank/top-down correlation There may be occasions when a researcher' s primary 
interest is with respect to the correlation among the most extreme scores in a set of data (1.e., that 
group of scores that comprise the highest and lowest values for both variables). The latter can 
be achieved through use of a procedure (developed by Salama and Quade (1981) and Quade and 
Salama (1992)) that weights scores such that the more extreme a score is, the greater its weight 
in determining the correlation coefficient. The latter procedure, which is referred to as weighted 
rank correlation or top-down correlation (Iman and Conover (1985, 1987)) is described in Zar 
(1999, pp. 398—401). 


VIII. Additional Examples Illustrating the Use of Spearman's 
Rank-Order Correlation Coefficient 


If a researcher elects to rank-order the scores of subjects in any of the examples for which the 
Pearson product-moment correlation coefficient is employed, a value can be computed for 
Spearman's rank-order correlation coefficient. Thus, as is the case for Example 28.1, the data 
for Examples 28.2 and 28.3 can be rank-ordered and evaluated with Spearman's rho. Since the 
rankings for the latter two examples are identical to the rankings for Example 29.1, all three 
examples yield the identical result. Since Kendall's tau and Spearman's rho can be employed 
to evaluate the same data, Example 30.1, as well as the data set presented in Table 30.4, can also 
be evaluated with Spearman's rho. 
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Endnotes 


1. It should be noted that although the scores of subjects in Example 29.1 are ratio data, in 
most instances when Spearman's rank-order correlation coefficient is employed it is 
more likely that the original data for both variables are in a rank-order format. As is noted 
in Section I, conversion of ratio data to a rank-order format (which is done in Section IV 
with respect to Example 29.1) is most likely to occur when a researcher has reason to 
believe that one or more of the underlying assumptions of the Pearson product-moment 
correlation coefficient are saliently violated. Example 29.2 in Section VI represents a 
study involving two variables that are originally in a rank-order format for which Spear- 
man's rho is computed. 


2. Some sources employ the following statements as the null hypothesis and the nondirectional 
alternative hypothesis for Spearman's rank-order correlation coefficient: Null hypothe- 
sis: H,: Variables X and Y are independent of one another; Nondirectional alternative 
hypothesis: H,: Variables X and Y are not independent of one another. 

It is, in fact, true that if in the underlying population the two variables are indepen- 
dent, the value of p, will equal zero. However, the fact that p. = O, in and of itself, 
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10. 


11. 


does not ensure that the variables are independent of one another. Thus, it is conceivable 
that in a population in which the correlation between X and Y is p, - 0, a nonmonotonic 
curvilinear function can be employed to describe the relationship between the variables. 


Daniel (1990) notes that the computed value of r, is not an unbiased estimate of p,. 


The reader may find slight discrepancies in the critical values listed for Spearman's rho 
in the tables published in different books. The differences are due to the fact that separate 
tables derived by Olds (1938, 1949) and Zar (1972), which are not identical, are employed 
in different sources. Howell (1992, 1997) notes that the tabled critical values noted in 
various sources are approximations and not exact values. 


The minimum sample size for which Equation 29.3 is recommended varies depending upon 
which source one consults. Some sources recommend the use of Equation 29.3 for values 
as low as n = 25, whereas others state that n should equal at least 100. 


The results obtained through use of Table A18, Equation 29.2, and Equation 29.3 will not 
always be in total agreement with one another. In instances where the different methods 
for evaluating significance do not agree, there will usually not be a major discrepancy 
between them. In the final analysis, the larger the sample size the more likely it is that the 
methods will be consistent with one another. 


The following will always be true when Equation 28.1 is employed in computing Pearson 
r (and r,), and the rank-orders are employed to represent the scores on the X and Y 
variables: XX - XY and XX? - XY? (however, the latter will only be true if there are 
no ties). 


The relationship between Spearman's rank-order correlation coefficient and Kendall’s 
coefficient of concordance is discussed in greater detail in Section VII of the latter test. 
In the latter discussion, it is noted that although when there are two sets of ranks the values 
computed for Spearman's rho and Kendall’s coefficient of concordance will not be 
identical, one value can be converted into the other through use of Equation 31.7. 


If the tie correction for the Friedman two-way analysis of variance by ranks is 
employed, the computed value of Xi will be slightly higher. 


The tabled critical two-tailed .05 and .01 chi-square values represent the chi-square values 
at the 95th and 99th percentiles, and the tabled critical one-tailed .05 and .01 chi-square 
values represent the chi-square values at the 90th and 98th percentiles. 


In the discussion of the Friedman two-way analysis of variance by ranks, it is assumed 
that a nondirectional analysis is always conducted for the latter test. A directional/one- 
tailed analysis is used here in order to employ probability values that are comparable to the 
one-tailed values employed in evaluating Spearman's rho. Within the Friedman test 
model, when k = 10, the usage of the term one-tailed analysis is really not meaningful. For 
a clarification of this issue (i.e., conducting a directional analysis when k > 3), the reader 
should read the discussion on the directionality of the chi-square goodness-of-fit test (Test 
8) in Section VII of the latter test (which can be generalized to the Friedman test). 
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Test 30 
Kendall’s Tau 


(Nonparametric Measure of Association/Correlation 
Employed with Ordinal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Kendall’s tau is one of a number of measures of correlation or association discussed in this book. 
Measures of correlation are not inferential statistical tests, but are, instead, descriptive statistical 
measures that represent the degree of relationship between two or more variables. Upon com- 
puting a measure of correlation, it is common practice to employ one or more inferential statistical 
tests in order to evaluate one or more hypotheses concerning the correlation coefficient. The 
hypothesis stated below is the most commonly evaluated hypothesis for Kendall's tau. 


Hypothesis evaluated with test In the underlying population represented by a sample, is the 
correlation between subjects’ scores on two variables some value other than zero? The latter 
hypothesis can also be stated in the following form: In the underlying population represented 
by the sample, is there a significant monotonic relationship between the two variables?! It is 
important to note that the nature of the relationship described by Kendall’s tau is based on an 
analysis of two sets of ranks. 


Relevant background information on test Prior to reading the material in this section the 
reader should review the general discussion of correlation in Section I of the Pearson product- 
moment correlation coefficient (Test 28), and the material in Section I of Spearman's rank- 
order correlation coefficient (Test 29) (which also evaluates whether a monotonic relationship 
exists between two sets of ranks). Developed by Kendall (1938), tau is a bivariate measure of 
correlation/association that is employed with rank-order data. The population parameter esti- 
mated by the correlation coefficient will be represented by the notation t (which is the lower case 
Greek letter tau). The sample statistic computed to estimate the value of t will be represented 
by the notation t. As is the case with Spearman's rank-order correlation coefficient, Ken- 
dall's tau can be employed to evaluate data in which a researcher has scores for n 
subjects/objects on two variables (designated as the X and Y variables), both of which are rank- 
ordered. Kendall’s tau is also commonly employed to evaluate the degree of agreement between 
the rankings of m = 2 judges for n subjects/objects. 

As is the case with Spearman’s rho, the range of possible values Kendall’s tau can assume 
is defined by the limits -1 to +1 (1.e., -1 < € < +1). Although Kendall’s tau and Spearman's 
rho share certain properties in common with one another, they employ a different logic with 
respect to how they evaluate the degree of association between two variables. Kendall’s tau 
measures the degree of agreement between two sets of ranks with respect to the relative ordering 
of all possible pairs of subjects/objects. One set of ranks represents the ranks on the X variable, 
and the other set represents the ranks on the Y variable. Specifically, assume data are in the form 
of the following two pairs of observations expressed in a rank-order format: a) (Ry, R y) (which, 
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respectively, represent the ranks on Variables X and Y for the i " subject/object); and b) (Ry, 
Ry ) (which, respectively, represent the ranks on Variables X and Y for the j n subject/object). 
If the sign/direction of the difference (Ry - Ry) is the same as the sign/direction of the 
difference (Ry - R,), a pair of ranks is said to 'be concordant (i.e., in agreement). If the 
sign/direction of the difference (R x, T R, ) is not the same as the sign/direction of the difference 
(Ry E Ry ), a pair of ranks is said to be discordant (i.e., disagree). If (Ry - Ry ) and/or 
(Ry 

conceptualized within the framework of a tie which is discussed in Section VI). Kendall’s tau 
is a proportion that represents the difference between the proportion of concordant pairs of ranks 
less the proportion of discordant pairs of ranks. The computed value of tau will equal +1 when 
there is complete agreement among the rankings (i.e., all of the pairs of ranks are concordant), 
and will equal —1 when there is complete disagreement among the rankings (i.e., all of the pairs 
of ranks are discordant). 

As a result of the different logic involved in computing Kendall’s tau and Spearman’s 
rho, the two measures have different underlying scales and, because of this, it is not possible to 
determine the exact value of one measure if the value of the other measure is known. As a 
general rule, however, the computed absolute value of € will always be less than the computed 
absolute value of r, for a set of data and, as the sample size increases, the ratio t/r« approaches 
the value .67.? Siegel and Castellan (1988) note the following inequality can be employed to 
describe the relationship between r, and 1: -1 < (3% - 2r,) < 1. 

In spite of the differences between Kendall’s tau and Spearman's rho, the two statistics 
employ the same amount of information and, because of this, are equally likely to detect a 
significant effect in a population. Thus, although for the same set of data different values will 
be computed for re and € (unless, as noted in Endnote 2, the correlation between the two 
variables is +1 or - 1), the two measures will essentially result in the same conclusions with 
respect to whether or not the underlying population correlation equals zero. The comparability 
of € and r, is discussed in more detail in Section V. 

In contrast to Kendall’s tau, Spearman's rho is more commonly discussed in statistics 
books as a bivariate measure of correlation for ranked data. Two reasons for this are as follows: 
a) The computations required for computing tau are more tedious than those required for 
computing rho; and b) When a sample is derived from a bivariate normal distribution (which is 
discussed in Section I of the Pearson product-moment correlation coefficient), the computed 
value r, will generally provide a reasonably good approximation of Pearson r, whereas the 
value of € will not. Since r, provides a good estimate of r, is can be employed to represent 
the coefficient of determination (i.e., a measure of the proportion of variability on one variable 
that can be accounted for by variability on the other variable)? One commonly cited advantage 
of tau over rho is that 7 is an unbiased estimate of the population parameter t, whereas the value 
computed for r, is not an unbiased estimate of the population parameter p,. Lindeman et al. 
(1980) note another advantage of tau is that unlike rho, the sampling distribution of tau 
approaches normality very quickly. Because of this, the normal distribution provides a good 
approximation of the exact sampling distribution of tau for small sample sizes. In contrast, a 
large sample size is required in order to employ the normal distribution to approximate the exact 
sampling distribution of rho. 


. - R,) result in the value zero, a pair of ranks is neither concordant or discordant (and is 


II. Example 


Example 30.1 Two psychiatrists, Dr. X and Dr. Y, rank-order ten patients with respect to their 
level of psychological disturbance (assigning a rank of 1 to the least disturbed patient and a rank 
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of 10 to the most disturbed patient). The rankings of the two psychiatrists (along with additional 
information that allows the value of Spearman's rho to be computed for the same set of data) 
are presented in Table 30.1. Is there a significant correlation between the rank-orders assigned 
to the patients by the two doctors? 


III. Null versus Alternative Hypotheses 


Upon computing Kendall's tau, it is common practice to determine whether the obtained 
absolute value of the correlation coefficient is large enough to allow a researcher to conclude that 
the underlying population correlation coefficient between the two variables is some value other 
than zero. Section V describes how the latter hypothesis, which is stated below, can be evaluated 
through use of tables of critical t values or through use of an inferential statistical test that is 
based on the normal distribution. 


Null hypothesis Hy: t = 0 


(In the underlying population the sample represents, the correlation between the ranks of subjects 
on Variable X and Variable Y equals 0.) 


Alternative hypothesis H,: t # 0 


(In the underlying population the sample represents, the correlation between the ranks of subjects 
on Variable X and Variable Y equals some value other than 0. This is a nondirectional 
alternative hypothesis, and it is evaluated with a two-tailed test. Either a significant positive 
t value or a significant negative t value will provide support for this alternative hypothesis. 
In order to be significant, the obtained absolute value of 7 must be equal to or greater than 
the tabled critical two-tailed € value at the prespecified level of significance.) 


or 
H,: t > 0 


(In the underlying population the sample represents, the correlation between the ranks of subjects 
on Variable X and Variable Y equals some value greater than 0. This is a directional alternative 
hypothesis, and it is evaluated with a one-tailed test. Only a significant positive t value will 
provide support for this alternative hypothesis. In order to be significant (in addition to the 
requirement of a positive t value), the obtained absolute value of t must be equal to or greater 
than the tabled critical one-tailed t value at the prespecified level of significance.) 


or 
H,: t < 0 


(In the underlying population the sample represents, the correlation between the ranks of subjects 
on Variable X and Variable Y equals some value less than 0. This is a directional alternative 
hypothesis, and it is evaluated with a one-tailed test. Only a significant negative € value will 
provide support for this alternative hypothesis. In order to be significant (in addition to the 
requirement of a negative t value), the obtained absolute value of t must be equal to or greater 
than the tabled critical one-tailed € value at the prespecified level of significance.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected.* 
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IV. Test Computations 


The data for Example 30.1 are summarized in Table 30.1. Although the last two columns of 
Table 30.1 are not necessary to compute the value of Kendall’s tau, they are included to allow 
for the computation of Spearman's rank-order correlation coefficient for the same set of data 
(which is done is Section V). 


Table 30.1 Data for Example 30.1 


Rankings Rankings 

Patient of Dr. X of Dr. Y 
Ry Ry d, = Ry 7 Ry d; 
1 T 10 -3 9 
2 1 2 -1 1 
3 8 6 2 4 
4 10 8 2 4 
5 9 7 2 4 
6 6 4 2 4 
7 5 9 -4 16 
8 3 3 0 0 
9 2 1 1 1 
10 4 5 -1 1 
Yd, = 0 Ed? = 44 


t 


Equation 30.1 is employed to compute the value of Kendall’s tau 


n,n 
ga C P? (Equation 30.1) 
n(n - 1) 
2 





Where: nois the number of concordant pairs of ranks 
Np is the number of discordant pairs of ranks 
[n(n - 1)]/2 is the total number of possible pairs of ranks 


In order to employ Equation 30.1 to compute the value of Kendall’s tau, it is necessary to 
determine the number of concordant versus discordant pairs of ranks. In order to do this, the data 
are recorded in the format employed in Table 30.2. 

The first row of Table 30.2 consists of the identification number of each subject. The order 
in which subjects are listed is based on their rank-order on the X variable (1.e., Ry). The latter 
set of ranks are recorded in the second row of the table. The third row lists each subject's rank- 
order on the Y variable (i.e., Ry ). Inspection of Table 30.2 reveals that no ties are present in the 


data on either the X or the Y variable. The protocol to be described in this section assumes that 
there are no ties. The protocol for handling ties is described in Section VI. The portion of Table 
30.2 that lies below the double line consists of cells in which there is either an entry of C or D 
(except for the number value to the left of each row). This part of the table provides information 
with regard to the concordant versus discordant pairs of observations for the two sets of ranks. 
Specifically, in each of the rows of the table that fall below the double line, the number to the left 
of a row is the Ry value (i.e., rank on the Y variable) of the subject represented by the column 


© 2000 by Chapman & Hall/CRC 


in which that value appears. Within each row, the R, value is compared with those R, values 
that fall in the columns to its right. In any instance Ghee an R, value in a column is lier than 
the R, value for the row, a C is recorded in the cell that is the intersection of that row and 
Eelimn. In any instance where an R, value in a column is smaller than the R, value for the 


TOW, a D is recorded in the cell that is the intersection of that row and column. The presence of 
a C ina cell indicates a concordant pair of observations, since the ordering of the ranks on both 
the X and Y variables for that pair of observations is in the same direction. The presence of a D 
in a cell indicates a discordant pair of observations, since the ordering of the ranks on both the 
X and Y variables for that pair of observations is in the opposite direction. 


Table 30.2 Computational Table for Kendall’s Tau 


Subject 2 9 8 10 7 6 1 3 5 4 EC ED 
Ry, i 2 3 4 5 6 T 8 9 10 
Ry, 2 1 3 5 9 4 10 6 7 8 

2 D 8 C € € 0 CCC 8 1 

i C C C C C C C C 8 0 

3 C C C C CCC 7 0 

5 Xt DCCC C 5 1 

9 D C DDD 1 4 

4 C C C C 4 0 

10 D D D 0 3 

6 C C 2 0 

7 C 1 0 

8 0 0 

EEC = no =36 EED =n,=9 


The last two columns of Table 30.2 contain the number of concordant (3C) versus 
discordant (22D) pairs of observations in each row. The value X€C = n c = 36 in the last row 
of Table 30.2 is the sum of the column labelled XC. The value LXC = n, = 36 represents the 
total number of C entries in the table, which is the total number of concordant pairs of obser- 
vations in the data. The value XXD = nj = 9 is the sum of the column labelled £D. The value 
YYD =n p = 9 represents the total number of D entries in the table, which is the total number 
of discordant pairs of observations in the data (which are also referred to as inversions). 

Substituting the values n, -36, ny = 9, and n= 10 in Equation 30.1, the value 7 = .60 
is computed. 

p= 599. eol 
10(10 - 1) 
a 





The reader should note the following with respect to the sign of t: a) When nc > np, the 
sign of 7 will be positive; b) When np > nọ, the sign of € will be negative; and c) When 
nç = hp, € will equal zero. 

Some sources employ the notation S to represent the value (no - nj) in the numerator 
of Equation 30.1. For Example 30.1, $ = 36 - 9 = 27. If the value S is employed, the value of 
€ can be computed with Equation 30.2. 


C 
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:- 2S - (2)(27) = 60 (Equation 30.2) 


nn - 1) (10)(10 — 1) 





Equation 30.3 can also be employed to compute 7. 


z-1- 75  .,1.. (00) . 6 (Equation 30.3) 
n(n - 1) 10(10 - 1) 

Ifthere are no tied ranks present in the data, Equation 30.3 can be employed to compute tau 
in conjunction with a less tedious method than the one based on use of Table 30.2. The 
alternative method (which becomes impractical when the sample size is large) involves the use 
of Figure 30.1. 


Patient 2 9 8 10 7 6 1 3 5 4 





Figure 30.1 Visual Representation of Discordant Pairs of Ranks for Example 30.1 


In Figure 30.1 the values of Ry and Ry are recorded as they appear in the second and 
third rows of Table 30.2. Lines are drawn to connect each of the n = 10 corresponding values 
of Ry and Ry. The total number of intersecting points in the diagram represents the number of 
discordant pairs of ranks or inversions in the data — i.e., the value of np. The value np = 9 
along with the value n = 10 are substituted in Equation 30.3 to compute the value € = .60. 
Although not required for use in the latter equation, the number of concordant ranks (which, 
along with np, are employed in Equation 30.1) can be computed as follows: 
nc = [n(n - 1)2] - ny. 


V. Interpretation of the Test Results 

The obtained value € - .60 is evaluated with Table A19 (Table of Critical Values for 
Kendall’s Tau) in the Appendix. Note that Table A19 lists critical values for both tau and S.° 
Table 30.3 lists the tabled critical two-tailed and one-tailed .05 and.01 values for tau and S for 


n= 10. 


Table 30.3 Exact Tabled Critical Values for € and S for n = 10 


T os IS 95 Tol IS oi 

Two-tailed values t= 511 t = .644 
S = 23 S - 29 

One-tailed values t = 467 t = .600 
S = 21 S =27 
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The following guidelines are employed in evaluating the null hypothesis. 

a) If the nondirectional alternative hypothesis H,: t * 0 is employed, the null hypoth- 
esis can be rejected if the obtained absolute value of t (or S) is equal to or greater than the tabled 
critical two-tailed value at the prespecified level of significance. 

b) If the directional alternative hypothesis H,: t > 0 is employed, the null hypothesis can 
be rejected if the sign of € (or S) is positive, and the obtained value of 7 (or S) is equal to or 
greater than the tabled critical one-tailed value at the prespecified level of significance. 

c) If the directional alternative hypothesis H,: v < 0 is employed, the null hypothesis can 
be rejected if the sign of € (or S) is negative, and the obtained absolute value of t (or S) is equal 
to or greater than the tabled critical one-tailed value at the prespecified level of significance. 

Employing the above guidelines, the nondirectional alternative hypothesis H,: v # 0 is 
supported at the .05 level, since the computed value € = .60 (S = 27) is greater than the tabled 
critical two-tailed value t4, = .511 (Sy, = 23). It is not supported at the .01 level, since 
€ = .60 (S = 27) is less than the tabled critical two-tailed value t9, = -644 (Ss, = 29). 

The directional alternative hypothesis H,: t > 0 is supported at the .05 level, since the 
computed value t = .60 is a positive number that is greater than the tabled critical one-tailed 
value t4, = -467 (Ss. = 21). Itis also supported at the .01 level, since € = .60 is equal to the 
tabled critical one-tailed value t9, =.600 (So = 27). 

The directional alternative hypothesis H,: t < 0 is not supported, since the computed 
value € =.60 (S) is a positive number. 


Test 30a: Test of significance for Kendall's tau When n > 10, the normal distribution 
provides an excellent approximation of the sampling distribution of tau. Equation 30.4 is the 
normal approximation for evaluating the null hypothesis Hy: t = 0. 


z =- 3n - 1) (Equation 30.4) 
Qn + 5) 


In view of the fact that the sample size n = 10 employed in Example 30.1 is just one subject 
below the minimum value generally recommended for use with Equation 30.4, the normal 
approximation will still provide a reasonably good approximation of the exact sampling 
distribution. When the appropriate values from Example 30.1 are substituted in Equation 30.4, 
the value z = 2.41 is computed. 


"e (3).60)/10(10 - D _ 241 
42 [(2)(10) + 5] 


Equations 30.5 and 30.6 are alternative equations for computing the value of z that yield 
the identical result. 




















pan —— ———Á (Equation 30.5) 
l 2Qn +5) | DIDO) + 5] 
Inn - 1) (9)(10)(10 - 1) 
z= io pn A i 2.41 (Equation 30.6) 
| n(n - Dn +5) | (10)(10 - DIQX10) + 5] 
18 18 
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The computed value z = 2.41 is evaluated with Table A1 (Table of the Normal 
Distribution) in the Appendix. In the latter table, the tabled critical two-tailed .05 and .01 
valuesarez o, = 1.96 andz,, = 2.58 ,and the tabled critical one-tailed 05 and 01 valuesare Z o; = 1.65 
and Z,, = 2.33. Since the sign of the z value computed with Equations 30.4— 30.6 will always 
be the same as the sign of € (and S), the guidelines that are described earlier in this section for 
evaluating a € (or S) value can also be applied in evaluating the z value computed with Equations 
30.4—30,6 (i.e., substitute z in place of € (or S) in the text of the guidelines for evaluating t (or 
S)). 

Employing the guidelines, the nondirectional alternative hypothesis H,: v # O is sup- 
ported at the .05 level, since the computed value z = 2.41 is greater than the tabled critical two- 
tailed value zo, = 1.96. Itis not, however, supported at the .01 level, since z = 2.41 is less than 
the tabled critical two-tailed value zo, = 2.58. 

The directional alternative hypothesis H,: t > 0 is supported at both the .05 and .01 
levels, since the computed value z = 2.41 is a positive number that is greater than the tabled 
critical one-tailed values zo, = 1.65 and Z,, = 2.33. 

The directional alternative hypothesis H,: t < 0 is not supported, since the computed 
value z= 2.41 is a positive number. In order for the alternative hypothesis H,;: t < 0 to be sup- 
ported, the computed value of z must be a negative number (as well as the fact that the absolute 
value of z must be equal to or greater than the tabled critical one-tailed value at the prespecified 
level of significance). 

Note that the results for the normal approximation are identical to those obtained when the 
exact values of the sampling distribution of tau are employed. A summary of the analysis of 
Example 30.1 follows: It can be concluded that there is a significant monotonic increasing/ 
positive relationship between the rankings of the two judges.’ The result of the analysis (based 
on the critical values in Table 30.3 and the normal approximation) can be summarized as follows 
(if it is assumed the nondirectional alternative hypothesis H,: t * 0 is employed): 7 = .60, 
p<.05. 

It is noted in Section I that if both Kendall’s tau and Spearman’s rho are computed for 
the same set of data, the two measures will result in essentially the same conclusions with respect 
to whether the value of the underlying population correlation equals zero. In order to 
demonstrate this, employing the relevant values from Table 30.1 in Equation 29.1, the value 
rs = .733 is computed for Example 30.1. 


6Xd? — ,  —— 604 


-]- =n 
n(n? - 1) 10[(10)? - 1] 


r = .733 


S 


Note that the values t = .60 and r, = .733 computed for Example 30.1 are not identical 
to one another, and that as noted in Section I, the absolute value of € is less than the abso- 
lute value of ry. It is also the case that the inequality -1 < (3% - 2r,) < 1 (which is noted 
in Section I) is substantiated, since (3)(.60) - (2)(.733) = .334 (which falls within the range - 1 
to +1). 

The computed value r, = .733 is evaluated with Table A18 (Table of Critical Values for 
Spearman's Rho) in the Appendix. Employing the latter table, it is determined that for n = 10, 
the tabled critical two-tailed .05 and .01 values are r= .648 and rs = .794, and the tabled 


critical one-tailed .05 and .01 values are r, aa .564 and fsa 7 .745. Employing the afore- 


mentioned critical values, the nondirectional alternative hypothesis H,: p, * 0 and the 
directional alternative hypothesis H,: p, > 0 are supported at the .05 level, since the com- 
puted value r, = .733 is greater than the tabled critical two-tailed value fsa T .648 and 
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the tabled critical one-tailed value r; = .564. The alternative hypotheses are not supported 
.05 
at the .01 level, since r, = .733 is less than the tabled critical two-tailed value rọ = .794 
.01 
and the tabled critical one-tailed value r = .745. This result is almost identical to that 
01 


obtained when Kendall’s tau is employed (although in the analysis for Kendall’s tau, the 
directional alternative hypothesis H,: t > 0 is supported at the .01 level). 

If the computed value r, = .733 is evaluated with Equation 29.2, the value t = 3.05 is 
computed. 


jl-r2 41-6083 


The ¢ value computed with Equation 29.2 is evaluated with Table A2 (Table of Student’s 
t Distribution) in the Appendix. The degrees of freedom employed are df = n - 2. Employing 
Table A2, it is determined that for df = 10 - 2 = 8, the tabled critical two-tailed .05 and .01 
values are tf), = 2.31 and £y, = 3.36, and the tabled critical one-tailed .05 and .01 values are 
tos = 1.86 and t,, = 2.90. Employing the aforementioned critical values, the nondirec- 
tional alternative hypothesis H,: p, # 0 is supported at the .05 level, since the computed value 
t = 3.05 is greater than the tabled critical two-tailed value t,, = 2.31. It is not supported at the 
.01 level, since t = 3.05 is less than ź = 3.36. The directional alternative hypothesis 
H,: p, > O is supported at both the .05 and .01 levels, since the computed value t = 3.05 is 
a positive number (since r, = .733 is a positive number) that is greater than the tabled critical 
one-tailed values £9, = 1.86 and t,, = 2.90. This result is identical to that obtained when 
Kendall’s tau is employed. 

The slight discrepancies between the various methods for assessing the significance of t 
and r, can be attributed to the fact that the values in Table A18 and the result of Equation 29.2 
are approximations of the exact sampling distribution of Spearman’s rho, as well as the fact that 
the use of the normal distribution for assessing the significance of tau also represents an approx- 
imation of an exact sampling distribution. However, for the most part, regardless of whether one 
elects to compute 7 or r, as the measure of association for Example 30.1, it will be con- 
cluded that the population correlation is some value other than zero, and the latter conclusion will 
be reached irrespective of whether a nondirectional or directional alternative hypothesis is 
employed. 


VI. Additional Analytical Procedures for Kendall’s Tau and/or 
Related Tests 


1. Tie correction for Kendall’s tau When one or more ties are present in a set of data, it is 
necessary to employ a tie correction in order to compute the value of t. To illustrate how ties 
are handled, let us assume that Table 30.4 summarizes the data for Example 30.1. Note that in 
contrast to Table 30.2, the data in Table 30.4 are characterized by the presence of ties on both 
the X and Y variables. Specifically, Subjects 2 and 9 are tied for the first ordinal position on the 
X variable, Subjects 8 and 10 are tied for the third ordinal position on the Y variable, and Subjects 
6 and 7 are tied for the ninth ordinal position on the Y variable. 

As is the case for Table 30.2, the entries C and D are employed in the cells of Table 30.4 
to indicate concordant versus discordant pairs of ranks. There are, however, three cells in Table 
30.4 that involve tied ranks in which the cell entry is 0. Note that if the Ry value for a row is 


equal to a Ry that falls in a column to its right, a 0 is written in the cell that is the intersection of 
J 
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that row and column. Since, however, this protocol only takes into account ties on the Y variable, 
it will not allow one to identify all of the cells in the table for which 0 is the appropriate entry. 
In point of fact, a 0 entry should also appear in any cell that involves a pair of tied observations 
on the X variable, even though the two ranks on the Y variable with which the X variable pair is 
being contrasted are not tied. In the above example there is just one set of ties on the X variable 
(the rank of 1.5 for Subjects 2 and 9). Note that the cell identified with an asterisk in the upper 
left of the table has a 0 entry, even though the rank-order directly to the right of the value 
Ry = 2 (which is the rank of Subject 2 on the Y variable) is Ry = 1 (which is the rank of 
Subject 9 on the Y variable). If the protocol described for Table 30.2 is employed, since the rank- 
order R, = 1 to the right of Ry = 2 is less than the latter value, a D should be placed in that 


cell. The reason for sinploying a0 in the cell is that if the arrangement of the ranks on the X 
variable is reversed, with Subject 9 listed first and Subject 2 listed second, the value of Ry 
for that row will be Ry = 1, and the first rank/R, value that it will be compared with will be 
Ry = 
Table 30.2, the appropriate entry for the cell under discussion is a C. Thus, whenever a different 
arrangement of the tied ranks on the X variable will result in a different letter entry for a cell (i.e., 
C versus D), that cell is assigned a 0. 


- 2. If the latter uahEerdstt is employed in ostio uctUn with the protocol described for 


Table 30.4 Computational Table for Kendall’s Tau Involving Ties 


Subject 2 9 8 10 7 6 1 3 5 4 xc XD 
Ry 15 15 3 4 5 6 7 8 9 10 
Ry 2 1 35 35 95 95 5 6 7 8 

2 0 C C C C C C C C 8 0 

1 C C O Cc C C C C 8 0 

3.5 0 C € C C C .C 6 0 

35 C C C C C € 6 0 

95 0 D D D D 0 4 

9.5 D D D D 0 4 

5 C C C 3 0 

6 C C 2 0 

7 C 1 0 

8 0 0 

XEC =n, = 34 XED =n, = 8 


A general protocol for determining whether a cell should be assigned a 0 to represent a tie 
can be summarized as follows: a) If the R, value at the left of a row is tied with an R, value 


that falls in a column to its right, a 0 should be placed in the cell that is the intersection ‘of that 
row and column; and b) If there is a tie between the values of R, and R, that fall directly above 


the values of R y and R, being compared, a 0 should be placed in the cell that is the intersection 
i j 
Of the row and column the values Ry and R, (as well as R, ) appear. 
i J 


It should be noted that when there are no ties present in the data, n. + np = [n(n - 1)J/2. 
Thus, in the case of Table 30.2, (nc = 36) + (ny = 9) = [00)(10 - 1)]/2 = 45. When, on 
the other hand, ties are present in the data, since entries of 0 are not counted as either concordant 
or discordant pairs, n. + ny * [n(n - 1)]/2. The latter can be confirmed by the fact that in 
Table 30.4, (n. = 34) + (ny = 8) » [(10)10 - 1/2. 
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The computation of the value of Kendall’s tau using the tie correction will now be de- 
scribed. In the example under discussion there is s = 1 set of ties involving the ranks of sub- 
jects’ X scores (Subjects 2 and 9), and s = 2 sets of ties involving the ranks of subjects’ Y 
scores (Subjects 8 and 10; Subjects 6 and 7). Equation 30.9 is employed to compute the tie- 


corrected value of Kendall’s tau, which will be represented by the notation 7,. Note that 


the values XT y and YT y in Equation 30.9 are computed with Equations 30.7 and 30.8. 
In Equation 30.7, t, represents the number of X scores that are tied for a given rank. In 
Equation 30.8, t, represents the number of Y scores that are tied for a given rank. The notations 
EUM = bi ) and ial, = I indicate that the following is done with respect to each of 
the variables: a) For each set of ties, the number of ties in the set is subtracted from the square 
of the number of ties in that set; and b) The sum of all the values computed in part a) is obtained 
for that variable. 

When the data from Table 30.4 are substituted in Equations 30.7—30.9, the tie-corrected 
value t, = .598 is computed.” 


Pe » (t; "p c RP -2] = 2 (Equation 30.7) 


T, - Y (t; ~ 4) = [Q? - 2] + [Q? - 2] = 4 (Equation 30.8) 
i 
yn cede yn - 1) eq 


(2)(34 - 8) 


voo- 1) -2 iodo - D - 4 


e 
I 





(Equation 30.9) 


= 598 


2. Regression analysis and Kendall's tau As noted in the discussion of Spearman’s rank- 
order correlation coefficient, regression analysis procedures have been developed for rank- 
order data. Sources for nonparametric regression analysis (including monotonic regression 
analysis) are cited in Section VI of the latter test. 


3. Partial rank correlation The computation of a partial correlation coefficient (described 
in Section IX of the Pearson product-moment correlation coefficient, and discussed briefly 
in Section VI ofSpearman's rank-order correlation coefficient), can be extended to Kendall’s 
tau. Thus, when the rank-orders for three variables are evaluated, Equation 28.72 can be em- 
ployed to compute one or more partial correlation coefficients for Kendall's tau (employing 
the relevant values of € in the equation). Conover (1980, 1999), Daniel (1990), Marascuilo and 
McSweeney (1977), and Siegel and Castellan (1988) discuss the computation of a partial 
correlation coefficient in reference to Kendall’s tau. It should be noted that the partial rank- 
order correlation coefficient for Kendall’s tau employs a different sampling distribution than the 
one that is employed for evaluating t. Tables for the appropriate sampling distribution can be 
found in Daniel (1990) and Siegel and Castellan (1988). 


4. Sources for computing a confidence interval for Kendall’s tau A procedure (attributed 


to Noether (1967)) for deriving a confidence interval for Kendall's tau is described in Daniel 
(1990). 
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VII. Additional Discussion of Kendall’s Tau 


1. Power efficiency of Kendall’s tau Daniel (1990) and Siegel and Castellan (1988) note that 
(for large sample sizes) the asymptotic relative efficiency (which is discussed in Section VII 
of the Wilcoxon signed-ranks test (Test 6)) of Kendall’s tau relative to the Pearson product- 
moment correlation coefficient is approximately .91 (when the assumptions underlying the 
latter test are met). 


2. Kendall’s coefficient of agreement Kendall’s coefficient of agreement is another measure 
of association that allows a researcher to evaluate the degree of agreement between m sets of 
ranks on n subjects/objects. The latter measure, which is described in Siegel and Castellan 
(1988), is essentially an extension of Kendall’s tau to more than two sets of ranks. The relation- 
ship between Kendall’s tau and Kendall’s coefficient of agreement is analogous to the 
relationship between Spearman's rho and Kendall’s coefficient of concordance (Test 31). 


VIII. Additional Examples Illustrating the Use of Kendall's Tau 


Since Spearman's rho and Kendall’s tau can be employed to evaluate the same data, Examples 
29.1 and 29.2, as well as the data set presented in Tables 29.2/29.4, can be evaluated with 
Kendall’s tau. It is also the case, that if a researcher elects to rank-order the scores of subjects 
in any of the examples for which the Pearson product-moment correlation coefficient is 
employed, a value can be computed for Kendall’s tau. To illustrate this, Example 28.1 (which 
is identical to Example 29.1) will be evaluated with Kendall’s tau. The rank-orders of the scores 
of subjects on the X and Y variables in Examples 28.1/29.1 are arranged in Figure 30.2. The 
arrangement of the ranks in Figure 30.2 allows for use of the protocol for determining the 
number of discordant pairs of ranks that is described in reference to Figure 30.1. Since none of 
the vertical lines intersect, the number of pairs of discordant ranks is np = 0. Since each subject 
has the identical rank on both the X and the Y variables, all of the pairs of ranks are concordant. 
The total number of pairs of ranks is [(5)(5 - 1)]/2 = 10, which is also the value of nọ. 


Subject 2 3 5 4 1 
Ry 1 2 3 4 5 

| | l l 
Ry, 1 2 3 4 5 


i 


Figure 30.2 Visual Representation of Discordant Pairs of Ranks 
for Examples 28.1/29.1 


Employing the values n = 5 and np = 0 in Equation 30.3, the value € = 1 is computed. 
The same value can also be computed with either Equation 30.1 or Equation 30.2, if the values 
nc = 10 and/or S = 10 are employed in the aforementioned equations. 


(40). 
5(5 - 1) 
t = 1 is identical to the value r, = 1 computed for the same set of data. As noted in 


Section I, when there is a perfect positive or negative correlation between the variables, identical 
values are computed for € and r;,. 


t-1- 
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Endnotes 


1. A discussion of monotonic relationships can be found in Section I of Spearman's rank- 
order correlation coefficient. 


2. The exception to this is that when the computed value of T is either +1 or - 1, the iden- 
tical value will be computed for rz. 


3. The coefficient of determination is discussed in Section V of the Pearson product- 
moment correlation coefficient. 


4. a) Some sources employ the following statements as the null hypothesis and the non- 
directional alternative hypothesis for Kendall’s tau: Null hypothesis: H,;: Variables X 
and Y are independent of one another; Nondirectional alternative hypothesis: H: 
Variables X and Y are not independent of one another. 

It is, in fact, true that if in the underlying population the two variables are 
independent, the value of t will equal zero. However, the fact that t = 0, in and of itself, 
does not ensure that the variables are independent of one another. Thus, it is conceivable 
that in a population in which the correlation between X and Y is t = 0, a nonmonotonic 
curvilinear function can be employed to describe the relationship between the variables. 

b) Note that in Example 30.1 the scores of subjects (who are the patients) on the X and 
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Y variables are the respective ranks assigned to the subjects/patients by Dr. X and Dr. Y. 
Thus, the null hypothesis can also be stated as follows: In the underlying population the 
sample of subjects/patients represents, the correlation between the rankings of Dr. X and 
Dr. Y equals 0. 


5. If either of the two values € or S is known, Equation 30.2 can be employed to compute 
the other value. Some sources only list critical values for one of the two values € or S. 


6. The following should be noted with respect to Equations 30.5 and 30.6: a) The 
denominator of Equation 30.5 is the standard deviation of the sampling distribution of the 
normal approximation of tau; and b) Based on a recommendation by Kendall (1970), 
Marascuilo and McSweeney (1977) (who employ Equation 30.6) describe the use of a 
correction for continuity for the normal approximation. In employing the correction for 
continuity with Equation 30.6, when S is a positive number, the value 1 is subtracted from 
S, and when S is a negative number, the value 1 is added to S. The correction for continuity 
(which is not employed by most sources) reduces the absolute value of z, thus resulting in 
a more conservative test. The rationale for employing a correction for continuity for a 
normal approximation of a sampling distribution is discussed in Section VI ofthe Wilcoxon 
signed-ranks test. 


7. Howell (1992, 1997) notes that the value t = .60 indicates that if a pair of subjects are 
randomly selected, the likelihood that the pair will be ranked in the same order is .60 higher 
than the likelihood that they will be ranked in the reverse order. 


8. The data for Examples 30.1 and 29.2 are identical, except for the fact that in the latter 
example there is a tie for the X score in the second ordinal position which involves the X 
scores in the eighth and ninth rows. 


9. If Equation 30.1 is employed to compute the value of t for the data in Table 30.4, the 
value t = .578 is computed. As noted in the text, because of the presence of ties, 
No + hp * [n(n - 1)/2. 


2l 34 - 8 
T= = 


10(10 - 1) 
2 
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Test 31 


Kendall’s Coefficient of Concordance 
(Nonparametric Measure of Association/Correlation 
Employed with Ordinal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Kendall’s coefficient of concordance is one of a number of measures of correlation or associ- 
ation discussed in this book. Measures of correlation are not inferential statistical tests, but are, 
instead, descriptive statistical measures that represent the degree of relationship between two or 
more variables. Upon computing a measure of correlation, it is common practice to employ one 
or more inferential statistical tests in order to evaluate one or more hypotheses concerning the 
correlation coefficient. The hypothesis stated below is the most commonly evaluated hypothesis 
for Kendall’s coefficient of concordance. 


Hypothesis evaluated with test In the underlying population represented by a sample, is the 
correlation between m sets of ranks some value other than zero? The latter hypothesis can also 
be stated in the following form: In the underlying population represented by the sample, are m 
sets of ranks independent of one another? 


Relevant background information on test Developed independently by Kendall and 
Babington-Smith (1939) and Wallis (1939), Kendall’s coefficient of concordance is a measure 
of correlation/association that is employed for three or more sets of ranks. Specifically, Ken- 
dall’s coefficient of concordance is a measure that allows a researcher to evaluate the degree 
of agreement between m sets of ranks for n subjects/objects. The population parameter estimated 
by the correlation coefficient will be represented by the notation W. The sample statistic 
computed to estimate the value of W will be represented by the notation W. The range of 
possible values within which Kendall’s coefficient of concordance may fall is 0 < W < +1. 
When there is complete agreement among all m sets of ranks, the value of W will equal 1.! 
When, on the other hand, there is no pattern of agreement among the m sets of ranks, W will 
equal 0. The value of W cannot be a negative number, since when there are more than two sets 
of ranks it is not possible to have complete disagreement among all the sets. Because of this, it 
becomes meaningless to use a negative correlation to describe the degree of association in the 
data when m > 3. 

It is important to note that Kendall’s coefficient of concordance is related to both 
Spearman's rank-order correlation coefficient (Test 29) and Friedman's two-way analysis 
of variance by ranks (Test 25). Specifically: a) The computed value of W for m sets of ranks 
is linearly related to the average value of Spearman's rho that can be computed for all possible 
pairs of ranks. The relationship between Kendall’s coefficient of concordance and Spearman's 
rank-order correlation coefficient is discussed in greater detail in Section VIL. It should be 
noted that although Kendall’s coefficient of concordance can be computed for two sets of ranks, 
in practice it is not. The latter can be attributed to the fact that in contrast to Spearman's rho and 
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Kendall’s tau (which are the measures of association that are employed with two sets of ranks), 
the value of W cannot be a negative number (which in the case of Spearman's rho and Ken- 
dall’s tau indicates the presence of an inverse relationship). Because the measures of association 
that are employed with two sets of ranks can assume a negative value, W is not directly 
comparable to them; and b) Although they were developed independently, Kendall’s 
coefficient of concordance and Friedman's two-way analysis of variance by ranks are based 
on the same mathematical model. Because of this, for a given set of data, the values computed 
for X (which is the Friedman test statistic) and W can be algebraically derived from one 
another. The relationship between Kendall’s coefficient of concordance and Friedman's two- 
way analysis of variance by ranks is discussed in Section VII. 


II. Example 


Example 31.1 Six instructors at an art institute rank four students with respect to artistic 
ability. A rank of 1 is assigned to the student with the highest level of ability and a rank of 4 to 
the student with the lowest level of ability. The rankings of the six instructors for the four 
students are summarized in Table 31.1. Is there a significant association between the rank- 
orders assigned to the four students by the six instructors? 


Table31.1 Data for Example 31.1 


Student 
Instructor 1 2 3 4 Totals 
1 3 2 1 4 
2 3 2 1 4 
3 3 2 1 4 
4 4 2 1 3 
5 3 2 1 4 
6 4 1 2 3 
YR, 20 11 7 22 T=60 
QR 400 121 49 484 U = 1054 


III. Null versus Alternative Hypotheses 


Upon computing Kendall’s coefficient of concordance, it is common practice to determine 
whether the obtained value of the correlation coefficient is large enough to allow a researcher to 
conclude that the underlying population correlation coefficient between the m sets of ranks is 
some value other than zero. Section V describes how the latter hypothesis, which is stated below, 
can be evaluated through use of tables of critical W values or through use of an inferential sta- 
tistical test that is based on the chi-square distribution. 


Null hypothesis H,: W = 0 


(In the underlying population the sample represents, the correlation between the m = 6 sets of 
ranks equals 0.) 


Alternative hypothesis H,; W #0 


(In the underlying population the sample represents, the correlation between the m = 6 sets of 
ranks equals some value other than 0. This is equivalent to stating that the m = 6 sets of ranks are 
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not independent of one another. When there are more than two sets of ranks, the alternative 
hypothesis will always be stated nondirectionally.’ In order to be significant, the obtained 
value of W must be equal to or greater than the tabled critical value of W at the prespecified 
level of significance.) 


IV. Test Computations 


The data for Example 31.1 are summarized in Table 31.1. Note that in Table 31.1 there are 
m = 6 instructors, who are represented by the six rows, and n = 4 students who are represented 
by the four columns. 

The summary values 7 = 60 and U = 1054 in Table 31.1 are computed as follows. 


T = X (ER) = XR, + XR, + XR, + XR, = 20+ 11 +7 + 22 = 60 
jal 


U = YER - GR + (OR)? + (ORY? + ORY? 
Ja 


= (20) + (11)? + (7 + (22} = 400 + 121 + 49 + 484 = 1054 


The coefficient of concordance is a ratio of the variance of the sums of the ranks for the 
subjects (i.e., the variance of the XR; values) divided by the maximum possible value that can 
be computed for the variance of the sums of the ranks (for the relevant values of m and n). 
Equation 31.1 summarizes the definition of W. 


Variance of YR; values 


W = (Equation 31.1) 


Maximum possible variance for YR, 
values for relevant values of m and n 


The variance of the YR, values (which is represented by the notation S) is computed with 
Equation 31.2. 


nU - (TY 
n 


S = (Equation 31.2) 


Substituting the appropriate values from Example 31.1 in Equation 31.2, the value $ = 154 
is computed. 


g - (41054) - (60? _ 
4 


154 


W is computed with Equation 31.3. The denominator of Equation 31.3 (which for Example 
31.1 equals 180) represents the maximum possible value that can be computed for the variance 
of the sums of the ranks. The only time the value of S will equal the value of the denominator 
of Equation 31.3 (thus resulting in the value W - 1) will be when there is perfect agreement 
among the m judges with respect to their rankings of the n subjects. 


S 


w-— S 
m?n(n? - 1) 
12 


(Equation 31.3) 
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Substituting the appropriate values in Equation 31.3, the value W - .856 is computed. 


w-— D4 -gs6 


ODA - 1] 
12 
Equation 31.4 is an alternative computationally quicker equation for computing the value 
of W. Equation 31.4, however, does not allow for the direct computation of S. The latter fact 
is noted, since some of the tables employed to evaluate whether W is significant list critical 


values for S rather than critical values for W. 
(Equation 31.4) 


12U - 3m?n(n + 1? _ (1200054) - BXA + 1 _ 
mnn? - 1) GPA - 1] l 


W = 856 





The fact the value of W is close to 1 indicates that there is a high degree of agreement 
among the six instructors with respect to how they rank the four students. 


V. Interpretation of the Test Results 


The obtained value W = .856 is evaluated with Table A20 (Table of Critical Values for 
Kendall’s Coefficient of Concordance) in the Appendix. Note that Table A20 lists critical 
values for both W and S. The S values in Table A20 are extracted from Friedman (1940), 
and the values of W were computed by substituting the appropriate value of S in Equation 31.3. 
In order to reject the null hypothesis, the computed value of W (or S) must be equal to or greater 
than the tabled critical value at the prespecified level of significance. For m = 6 and n = 4, the 
tabled critical .05 and .01 values for W (S)in TableA20 are W,. = .421 (Sy, = 75.7 )andW,, = .553 
(So, = 99.5). Since the computed value W = .856 (S = 154) is greater than all of the 
aforementioned critical values, the alternative hypothesis H,: W # 0 is supported at both the 
.05 and .01 levels. 


Test 31a: Test of significance for Kendall's coefficient of concordance When exact tables 
for W (or S) are not available, the chi-square distribution provides a reasonably good approx- 
imation of the sampling distribution of W. The chi-square approximation of the sampling 
distribution of W is computed with Equation 31.5. The degrees of freedom employed for 
Equation 31.5 are df= n - 1. 


xX = mn - DW (Equation 31.5) 


When the appropriate values from Example 31.1 are substituted in Equation 31.5, the value 
X5 = 15.41 is computed. 


X = (64 - 1)(.856) = 15.41 


The value x? = 15.41 is evaluated with Table A4 (Table of the Chi-Square Distribu- 
tion) in the Appendix. In order to reject the null hypothesis, the obtained value of y? must be 
equal to or greater than the tabled critical value at the prespecified level of significance. For 
df= 4 - 1 =3, the tabled critical values are Xs - 7.81 and Xo = 11.34 (which are the chi- 
square values at the 95th and 99th percentiles). Since x? = 15.41 is greater than both of the 
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aforementioned critical values, the alternative hypothesis H,: W + 0 is supported at both the 
.05 and .01 levels. 

For small sample sizes, the exact sampling distribution of the Friedman two-way analysis 
of variance by ranks (which, as noted in Section I, is mathematically equivalent to Kendall’s 
coefficient of concordance) can be employed to evaluate the significance of W. In addition, 
when the values of m and n are reasonably small, some sources (e.g., Marascuilo and McSweeney 
(1977) and Siegel and Castellan (1988)) evaluate the significance of W by employing an ad- 
justed chi-square value (discussed in Section VII of the Friedman two-way analysis of variance 
by ranks) which represents an exact value for the underlying sampling distribution. For m = 6 
and n - 4, the adjusted/exact .05 and .01 critical values are X - 7.60 and xi - 10.00 
(which are reasonably close to the values Xs - 7.81 and Yi -^11.34). Since the computed 
value X? = 15.41 is greater than both of the aforementioned critical values, the alternative hy- 
pothesis H,: W # 0 is supported at both the .05 and .01 levels. Thus, regardless of which tables 
are employed to evaluate the results of Example 31.1, the alternative hypothesis H,: W # 0 is 
supported at both the .05 and .01 levels. Consequently, one can conclude there is a significant 
association among the six instructors with respect to how they rank the four students. 


VI. Additional Analytical Procedures for Kendall’s Coefficient of 
Concordance and/or Related Tests 


1. Tie correction for Kendall’s coefficient of concordance When ties are present in a set of 
data, some sources recommend that the value of W computed with Equations 31.3/31.4 be 
adjusted. Unless there is an excessive number of ties, the difference between the value of 
W computed with Equations 31.3/31.4 and the value computed with the tie correction will be 


minimal. The tie correction, which results in a slight increase in the value of W, will be 
illustrated with Example 31.2. 


Example 31.2 Four judges rank four contestants in a beauty contest. The judges are told to 
assign the most beautiful contestant a rank of 1 and the least beautiful contestant a rank of 4. 
The rank-orders of the four judges are summarized in Table 31.2. Is there a significant 
association between the rank-orders assigned to the four contestants by the four judges? 


Table31.2 Data for Example 31.2 


Contestant 

Judge 1 2 3 4 Totals 

1 1 3 3 3 

2 1 4 2 3 

3 2 3 1 4 

4 1.5 1.5 3.5 3.5 

YR, 5.5 11.5 9.5 13.5 T - 40 
QR 30.25 132.25 90.25 182.25 U - 435 


In Example 31.2, there are m = 4 sets of ranks/judges and n = 4 subjects/contestants who 
are ranked. Inspection of Table 31.2 reveals that Judges 1 and 4 employ tied ranks. As is the 
case with other rank-order tests described in the book, subjects who are tied for a specific rank 
are assigned the average of the ranks that are involved. Judge 1 assigns a rank of 1 to Contestant 
1, and places the other three contestants in a tie for the next ordinal position. Thus, Contestants 
2, 3, and 4 are all assigned a rank of 3, which is the average of the three ranks involved (i.e., 
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(2 +3 + 4)/3 2 3). Judge 4 places Contestants 1 and 2 in a tie for the first and second ordinal 
positions, and Contestants 3 and 4 in a tie for the third and fourth ordinal positions. Thus, the 
contestants evaluated by Judge 4 are assigned ranks that are the average of those ranks for which 
they are tied (i.e., (1 + 2)/2 = 1.5 and (3 + 4)/2 = 3.5). 

Equation 31.6 (which is the tie-corrected version of Equation 31.4) is employed to compute 
the tie-corrected value of Kendall’s coefficient of concordance, which will be represented by 
the notation W.. 


E = 2 2 
W, = . .. 12U-3m'n(n* 1)» ^ (Equation 31.6) 


m?^n(n? - 1) - mY Y d = | 


i-l |a-1 





The notation ©", Ee - t,)] in Equation 31.6 indicates the following: a) Within each 
set of ranks, for each set of ties that is present the number of ties in the set is subtracted from the 
cube of the number of ties in that set; b) The sum of all the values computed in part a) is obtained 
for that set of ranks; and c) The sum of the values computed in part b) is computed for the m sets 
of ranks. 

In the case of Example 31.2, Judge 1 has s = | set of ties involving three contestants. Thus, 
for Judge 1, xr =f) = [(3)° - 3] = 24. Since Judges 2 and 3 do not employ any ties, the 
latter two judges will not contribute to the tie correction, and thus the value of X! p - L) 
will equal 0 for both of the aforementioned judges. Judge 4 has s = 2 sets of ties, each set 
involving two contestants. Thus, for Judge 4, x -t) = [Q? - 2] + IQ? - 2] = 12. 


We can now determine the value 17, eae - t,)] = 36, which is employed in Equation 31.6. 


EEc-o0 2-24 «0 « 0 « 12 = 36 


i=1 |a=1 





When the appropriate values are substituted in Equation 31.6, the tie-corrected value 
W, = .51 is computed for Example 31.2.4 


w - 02435 - OA I s 
(4X I4 - 1] - (086) 





It can be seen below that when Equation 31.4 (which does not employ the tie correction) 
is employed to compute W, the value W - .44 is obtained. Note that the latter value is less 
than W, = .51. Thecomputed correlation W, = .5] (as wellas W = .44) indicates a moderate 
degree of association between the four sets of ranks. 


w - 2435 - OWA D _- 
APO - 1] 


Note that in Table A20 the tabled critical .05 and.01 values for m = 4 and n = 4 are 
W 5 = -619 and Wp; = .768. Since both W, = .51 and W = .44 are less than W,. = .619, 
the null hypothesis H,: W = 0 cannot be rejected. 


44 
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VII. Additional Discussion of Kendall’s Coefficient of Concordance 


1. Relationship between Kendall’s coefficient of concordance and Spearman's rank-order 
correlation coefficient The relationship between Kendall’s coefficient of concordance and 
Spearman's rank-order correlation coefficient is as follows: If for data consisting of m sets 
of ranks a value for Spearman's rho is computed for every possible pair consisting of two sets 
of ranks (i.e., if m = 3, Fg Tsg rs), the average of all the r, values (to be designated 5 is 
a linear function of the value of W computed for the data. Equation 31.7 defines the exact rela- 
tionship between Spearman's rho and W for the same set of data. 


- qmW-1 
m-1 


s (Equation 31.7) 


The above relationship will be demonstrated employing the data in Table 31.3 (which we 
will assume is a revised set of data for Example 31.1, in which m = 3 and n = 3). 


Table 31.3 Data for Use in Equation 31.7 


Student 
Instructor 1 2 3 Totals 
1 3 1 2 
2 1 2 3 
3 3 2 1 
YR, 7 5 6 T - 18 
QR 49 25 36 U = 110 


Substituting the appropriate values in Equation 31.4, the value W = .111 iscomputed. The 
latter value indicates a weak degree of association between the three sets of ranks.? 


w. 22010) - OEP@B +1 _ ay 
GPO? - 1] 





Substituting W = .111 in Equation 31.7, the value T = -.333 is computed. 
5- GUID-T _ -.333 
3-1 
We will now confirm that r, = -.333. Equation 29.1 is employed to compute the r, 


values for the 3 pairs of ranks (i.e., ls for the ranks of Instructor 1 versus Instructor 2; rs for 
12 13 
the ranks of Instructor 1 versus Instructor 3; ry for the ranks of Instructor 2 versus Instructor 
23 
3). The resulting values are r = -.5,r, = .5,andr, = -1. The average of the values of 
2 Xip 13 


the three pairs of ranks is r, = [(-.5) + ET (11/3 = -.333, thus confirming the result 
obtained with Equation 31.7. It should be noted that when Equation 31.7 is employed to 
compute the value of 7,, the range of values within which r, can fall is defined by the following 
limits: [-1/(m - 1)] < rọ < +1. When m =3, as is the case in the example under discussion, 
the minimum possible value r, can assume is - 1/(3 - 1) = -.5. Note that even though the sign 
of W cannot be negative, Equation 31.7 can convert a positive W value into either a positive or 
negative r, value. 
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The relationship described by Equation 31.7 can also be demonstrated for any of the ex- 
amples employed in illustrating Spearman's rank-order correlation coefficient, where m = 2. 
To illustrate, in the case of Example 29.2 the value r, = .72 is computed for two sets of ranks. 
When the relevant values from Example 29.2 (which are summarized in Table 29.6)° are sub- 
stituted in Equation 31.4, the value W =.86 is computed. Note that in Example 29.2, m = 2 and 
n= 10. 


Ñ - (12)(1493.5) - (332 (10)(10 + 1) - 
PAQO - 1] 


86 


Substituting W - .86 in Equation 31.7 yields the value r, = .72, which equals r, = .72 
computed with Equation 29.1. 


= _ (2)(.86) - 1 


r = .72 
2-1 


Uu 


Thus, when m = 2, the value of Fs will equal r, since the average of a single value (based 
on one pair of ranks) is that value. 


2. Relationship between Kendall’s coefficient of concordance and the Friedman two-way 
analysis of variance by ranks In Section I it is noted that Kendall’s coefficient of con- 
cordance and the Friedman two-way analysis of variance by ranks are based on the same 
mathematical model. Equation 31.8 defines the relationship between the computed value of 
W and X. The chi-square value (x2) in Equation 31.8 can be employed to represent the 
test statistic for the Friedman two-way analysis of variance by ranks (which is more 
commonly computed with Equation 25.1). Note that Equation 31.8 is identical to Equation 31.5. 


x; = mn - DW (Equation 31.8) 


Equation 31.9, which is the algebraic transposition of Equation 31.8, provides an alternative 
way of computing the value W. 


2 
W = ZR M (Equation 31.9) 
m(n - 1) 


In order to employ Equation 31.9 to compute the value of W, it is necessary to evaluate the 
data for m sets of ranks on n subjects/objects with the Friedman two-way analysis of variance 
by ranks. To illustrate the equivalence of Kendall’s coefficient of concordance and the 
Friedman two-way analysis of variance by ranks, consider Example 31.3 which employs the 
same variables employed in Example 25.1 (which is used to illustrate the Friedman two-way 
analysis of variance by ranks). Note that in Example 31.3 there are m = 6 judges (who are 
represented by the six subjects) and n = 3 objects (which are represented by the three levels of 
noise). 


Example 31.3 Six subjects rank three levels of noise (based on the presence or absence of 
different types of music) with respect to the degree they believe each level of noise will disrupt 
one's ability to learn a list of nonsense syllables. The subjects are instructed to assign a rank 
of 1 to the most disruptive level of noise and a rank of 3 to the least disruptive level of noise. 
Table 31.4 summarizes the rankings of the subjects. Is there a significant association between 
the rank-orders assigned to the three levels of noise by the six subjects? 
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Employing Equation 31.4, the value W = .92 iscomputed. The value W = .92 indicates 
a strong degree of association between the six sets of ranks. 


y- 024985) - Q(0*3G * 1? y 
(6 GIG. - 1] 


2 


Table 31.4 Data for Example 31.3 


Type of noise 


Classical Rock 
Subject No noise music music Totals 
1 3 2 1 
2 3 2 1 
3 3 2 1 
4 3 2 1 
5 3 2 1 
6 3 1.5 1.5 
YR, 18 11.5 6.5 T 236 
(ZR, 324 132.25 42.25 U = 498.5 


It happens to be the case that the configuration of ranks in Example 31.3 is identical to the 
configuration of ranks employed in Example 25.1. When the Friedman two-way analysis of 
variance by ranks is employed to evaluate the same six sets of ranks, the value X: - 11.08 
is computed. The reader should take note of the fact that when the data are evaluated with 
Equation 25.1 in Section IV of the Friedman test, k is employed to represent the number of 
levels of the independent variable and n is employed to represent the number of subjects. In 
Table 25.1, the three columns of R; values represent the k = 3 levels of the independent variable, 
and the six rows represent the n = 6 subjects. In the model employed for Kendall’s coefficient 
of concordance, the value of n corresponds to the value employed for k in the Friedman model, 
and thus, n = k 2 3. The value of m in the Kendall model corresponds to the value employed 
for n in the Friedman model, and thus, m = n = 6. The equations used in this section employ 
notation that is consistent with the Kendall model. 

When the value X = 11.08 is substituted in Equation 31.9, the value W = .92 is 
computed. 


W - 11.08 _ 
(6)(3 - 1) 
In the same respect, if W = .92 is substituted in Equations 31.8/31.5, it yields the value 
2 8 
X, = 11.08. 


x) = (63 - 1)(.92) = 11.08 


Since the value of W can be computed for the Friedman test model, Kendall’s coefficient 
of concordance can be employed as a measure of effect size for a within-subjects design (in- 
volving data that are rank-ordered) with an independent variable that has three or more levels. 
The closer the value of W is to 1, the stronger the relationship between the independent and 
dependent variables. Consequently, the value W = .92 computed for Example 25.1 (as well 
as Example 31.3), indicates there is a strong degree of association between the independent 
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variable (noise) and the dependent variable (the rank-ordering on number of nonsense syllables 
recalled/disruptive potential of noise). 


3. Weighted rank/top-down concordance Section VII of Spearman's rank-order correla- 
tion coefficient briefly discusses a method referred to as weighted/top-down correlation that 
can be employed for differentially weighting the most extreme scores in a set of data. Zar (1999, 
pp. 449—450) describes the extension of this method to Kendall’s coefficient of concordance. 
Thus, a weighted/top-down correlation coefficient can be employed if a researcher's primary 
concern is with the degree of agreement for objects/subjects that are ranked the highest by a set 
of judges. In such a case the correlation coefficient would minimally weight scores of a lower 
rank. 


VIII. Additional Examples Illustrating the Use of Kendall’s 
Coefficient of Concordance 


Examples 31.4 and 31.5 are two additional examples that can be evaluated with Kendall's 
coefficient of concordance. Example 31.4 addresses the same question evaluated by Example 
29.2, but in Example 31.4 the values m = 6 and n = 4 are employed in place of the values m = 2 
and n = 10 employed in Example 29.2. Since Examples 31.4 and 31.5 employ the same data as 
Example 31.1, they yield the same result. 


Example 31.4 In order to determine whether critics agree with one anoher in their evaliaion 
of movies, a newspaper editor asks six critics to rank four movies (assigning a ran&f 1 to the 
best movie, a rank of 2 to the next best movie, etc.). Table 31.5 summarzes the dati pr the 
study. Is there asignificmt asocidion between the six sets of raks? 


Table 31.5 Data for Example 31.4 


Movie 
Critic 1 2 3 4 Totals 
1 3 2 1 4 
2 3 2 1 4 
3 3 2 1 4 
4 4 2 1 3 
5 3 2 1 4 
6 4 1 2 3 
YR, 20 11 7 22 T=60 
(ZR) 400 121 49 484 U = 1054 


Example 31.5 Four members of a track team are ranked by the head coach with respect to their 
ability on six track and field events. For each event, the coach assigns a rank of 1 to the athlete 
who is best at the event and a rank of 4 to the athlete who is worst at the event. Table 31.6 
summarizes the data for the study. Is there a significant association between the rank-orders 
assigned to the athletes on the six events? 


Note that in Example 31.5, even though one judge (the coach) is employed, the judge 
generates six sets of ranks (i.e., six sets of judgements). If there is a significant association 
between the six sets of ranks/judgements, it indicates that the athletes are perceived to be con- 
sistent with respect to performance on the six events. 
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Table 31.6 Data for Example 31.5 


Athlete 

Event 1 2 3 4 Totals 
Sprint 3 2 1 4 

1500 meters 3 2 1 4 

Pole vault 3 2 1 4 

Long jump 4 2 1 3 

Shot put 3 2 1 4 

400 meters 4 1 2 3 

YR, 20 11 7 22 T - 60 
(XR) 400 121 49 484 U = 1054 
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Endnotes 


1. Siegel and Castellan (1988) emphasize the fact that a correlation equal to or close to 1 does 
not in itself indicate that the rankings are correct. A high correlation only indicates that there 
is agreement among the m sets of ranks. It is entirely possible that there can be complete 
agreement among two or more sets of ranks, but that all of the rankings are, in fact, incorrect. 
In other words, the ranks may not reflect what is actually true with regard to the subjects/ 
objects that are evaluated. 


2. In point of fact, if the values of r, and W are computed for m = 2 sets of ranks, when the 
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computed values for r, are, respectively, 1, —1, and 0 the computed values of W will, 
respectively, be 1, 0, and .5. The latter sets of values can be obtained through use of 
Equation 31.7, which is presented in Section VII. 


3. Some sources state that the alternative hypothesis is directional, since W can only be a 
positive value. Related to this is the fact that only the upper tail of the chi-square distribution 
(which is discussed in Section V) is employed in approximating the exact sampling distri- 
bution of W. In the final analysis, it becomes academic whether one elects to identify the 
alternative hypothesis as directional or nondirectional. 


4. The tie-corrected version of Equation 31.3 is noted below: 


Waco S aL 35 5 
2g42 qv owe INS SS (4 A4)4)? - 1] - (486) 
m?n(n? - 1) m, È (a t) [ra -1 -oco 
12 


5. Note that for m = 3 and n = 3, no tabled critical values are listed in Table A20. This is the 
case, since critical values cannot be computed for values of m and n that fall below specific 
minimum values. If Equation 31.5 is employed to evaluate W = .111, it yields the follow- 
ing result: y? = (3)(3 - 1)(.111) 2.666. Since X? = .666 is less than the tabled critical two- 
tailed value (for df = 2) Xs - 5.99, the obtained value W - .111 is not significant. In 
point of fact, even if the maximum possible value W - 1 is substituted in Equation 31.5, 
it yields the value X? = 6, which is barely above Xs = 5.99. Since the chi-square dis- 
tribution provides an approximation of the exact sampling distribution, in this instance it 
would appear that the tabled value Xs = 5.99 is a little too high and, in actuality, is 
associated with a Type I error rate that is slightly above .05. 


6. The summary of the data for Example 29.2 in Table 29.6 provides the necessary values 
required to compute the value of W. The latter values are not computed in Table 29.5, 


which (employing a different format) also summarizes the data for Example 29.2. 


7. Although there is one set of ties in the data, the tie correction described in Section VI is not 
employed for Example 31.3. 


8. The exact value xi - 11.08 is computed if the value W - .9236 (which carries the com- 
putation of W to four decimal places) is employed in Equations 31.8/31.5. 
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Test 32 


Goodman and Kruskal’s Gamma 
(Nonparametric Measure of Association/Correlation 
Employed with Ordinal Data) 


I. Hypothesis Evaluated with Test and Relevant Background 
Information 


Goodman and Kruskal's gamma is one of a number of measures of correlation or association 
discussed in this book. Measures of correlation are not inferential statistical tests, but are, 
instead, descriptive statistical measures that represent the degree of relationship between two or 
more variables. Upon computing a measure of correlation, it is common practice to employ one 
or more inferential statistical tests in order to evaluate one or more hypotheses concerning the 
correlation coefficient. The hypothesis stated below is the most commonly evaluated hypothesis 
for Goodman and Kruskal's gamma. 


Hypothesis evaluated with test In the underlying population represented by a sample, is the 
correlation between subjects' scores on two variables some value other than zero? 


Relevant background information on test Prior to reading the material in this section the 
reader should review the general discussion of correlation in Section I of the Pearson product- 
moment correlation coefficient (Test 28), as well as the material in Section I of Kendall's tau 
(Test 30). Developed by Goodman and Kruskal (1954, 1959, 1963, 1972), gamma is a bivariate 
measure of correlation/association that is employed with rank-order data which is summarized 
within the format of an ordered contingency table. The population parameter estimated by the 
correlation coefficient will be represented by the notation y (which is the lower case Greek letter 
gamma). The sample statistic computed to estimate the value of y will be represented by the 
notation G. As is the case with Spearman's rank-order correlation coefficient (Test 29) and 
Kendall’s tau, Goodman and Kruskal’s gamma can be employed to evaluate data in which 
a researcher has scores for n subjects/objects on two variables (designated as the X and Y 
variables), both of which have been rank-ordered. However, in contrast to Spearman's rho and 
Kendall’s tau, computation of gamma is recommended when there are many ties in a set of data, 
and thus it becomes more efficient to summarize the data within the format of an ordered r x c 
contingency table. 

An ordered r x c contingency table consists of r x c cells, and is comprised of r rows and 
c columns.! In the model employed for Goodman and Kruskal’s gamma, each of the rows in 
the contingency table represents one of the r levels of the X variable, and each of the columns 
represents one of the c levels of the Y variable (or vice versa). Since the contingency table that 
is employed to summarize the data is ordered, the categories for both the row and the column 
variables are arranged sequentially with respect to magnitude/ordinal position. To be more 
specific, the first row in the table represents the category that is lowest in magnitude on the X 
variable and the r " row represents the category that is highest in magnitude on the X variable. 
In the same respect, the first column represents the category that is lowest in magnitude on the Y 
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variable and the c " column represents the category that is highest in magnitude on the Y vari- 
able.” Recorded within each of the r x c cells of the contingency table are the number of subjects 
whose categorization on the X and Y variables corresponds to the row and column of a specific 
cell. 

The value of gamma computed for a set of data represents the difference p(C) - p(D), 
where: a) p(C) is the probability that the ordering of the scores on the row and column variables 
for a pair of subjects is concordant (i.e., in agreement); and b) p(D) is the probability that the 
ordering of the scores on the row and column variables for a pair of subjects is discordant (i.e., 
disagree). 

To illustrate, if a subject is categorized on the lowest level of the row variable and the 
highest level of the column variable, that subject is concordant with respect to ordering when 
compared with any other subject who is assigned to a lower category on the row variable than 
he is on the column variable. On the other hand, that subject is discordant with respect to 
ordering when compared with another subject who is assigned to a higher category on the row 
variable than he is on the column variable. For a more thorough discussion of the concepts of 
concordance and discordance the reader should review Section I of Kendall’s tau. 

The range of possible values within which a computed value of gamma may fall is -1 < 
G x +1. As is the case for Kendall’s tau, a positive value of G indicates that the number of 
concordant pairs in a set of data is greater than the number of discordant pairs, while a negative 
value indicates that the number of discordant pairs is greater than the number of concordant pairs. 
The computed value of G will equal 1 when the ordering of scores for all of the pairs of subjects 
in a set of data is concordant, and will equal - 1 when the ordering of scores for all of the pairs 
of subjects is discordant. When G = 0, the number of concordant and discordant pairs of sub- 
jects in a set of data is equal. 

Since Goodman and Kruskal’s gamma and Kendall’s tau both involve evaluating pairs 
of scores with respect to concordance versus discordance, the two measures of association are 
related to one another. Marascuilo and McSweeney (1977), who provide a detailed discussion 
on the nature of the relationship between gamma and tau, note that if t and G are computed 
for the same set of data, as the number of pairs of ties increase, the absolute value computed for 
G will become increasingly larger relative to the absolute value of t. As a result of the latter, 
researchers who want to safeguard against obtaining an inflated value for the degree of 
association between the two variables may prefer to compute € for a set of rank-order data, as 
opposed to computing the value of G. 

It should be noted that Yule's Q (Test 16i) (which is one of a number of measures of 
association that can only be employed to evaluate a 2 x 2 contingency table) represents a special 
case of Goodman and Kruskal's gamma. Although gamma can be employed with a 2 x 2 
contingency table, it is typically employed with ordered contingency tables in which there are 
at least three levels on either the row or column variable. A more detailed discussion of the rela- 
tionship between Yule's Q and Goodman and Kruskal's gamma can be found in Section VII. 


II. Example 


Example 32.1 A researcher wants to determine whether or not a relationship exists between 
a person's weight (which will be designated as the X variable) and birth order (which will be 
designated as the Y variable). Upon determining the weight and birth order of 300 subjects, each 
subject is categorized with respect to one of three weight categories and one of four birth order 
categories. Specifically, the following three categories are employed with respect to weight: 
below average, average, above average. The following four categories are employed with 
respect to birth order: first born, second born, third born, fourth born and all subsequent 
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birth orders. Table 32.1 (which is a 3 x 4 ordered contingency table, with r = 3 and c = 4) 
summarizes the data. Do the data indicate there is a significant association between a person's 
weight and birth order? 


Table 32.1 Summary of Data for Example 32.1 


Birth order Row sums 
1st born 2nd born 3rd born 4th born+ 





Below average 100 

Weight Average 100 
Above average 100 

Column sums 90 90 65 55 300 


III. Null versus Alternative Hypotheses 


Upon computing Goodman and Kruskal’s gamma, it is common practice to determine whether 
the obtained absolute value of the correlation coefficient is large enough to allow a researcher 
to conclude that the underlying population correlation coefficient between the two variables is 
some value other than zero. Section V describes how the latter hypothesis, which is stated below, 
can be evaluated through use of an inferential statistical test that is based on the normal 
distribution. 


Null hypothesis Hy: y = 0 


(In the underlying population the sample represents, the correlation between the scores/ 
categorization of subjects on Variable X and Variable Y equals 0.) 


Alternative hypothesis H,: y #0 


(In the underlying population the sample represents, the correlation between the scores/ 
categorization of subjects on Variable X and Variable Y equals some value other than 0. This is 
a nondirectional alternative hypothesis, and it is evaluated with a two-tailed test. Either a 
significant positive G value or a significant negative G value will provide support for this alter- 
native hypothesis. In order to be significant, the obtained absolute value of G must be equal to 
or greater than the tabled critical two-tailed G value at the prespecified level of significance.) 


or 
H,: y > 0 

(In the underlying population the sample represents, the correlation between the scores/ 

categorization of subjects on Variable X and Variable Y equals some value greater than 0. This 

is a directional alternative hypothesis, and it is evaluated with a one-tailed test. Only a 

significant positive G value will provide support for this alternative hypothesis. In order to be 

significant (in addition to the requirement of a positive G value), the obtained absolute value of 


G must be equal to or greater than the tabled critical one-tailed G value at the prespecified level 
of significance.) 


or 
H:y«0 


(In the underlying population the sample represents, the correlation between the scores/ 
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categorization of subjects on Variable X and Variable Y equals some value less than 0. This is 
a directional alternative hypothesis, and it is evaluated with a one-tailed test. Only a signif- 
icant negative G value will provide support for this alternative hypothesis. In order to be 
significant (in addition to the requirement of a negative G value), the obtained absolute value of 
G must be equal to or greater than the tabled critical one-tailed G value at the prespecified level 
of significance.) 


Note: Only one of the above noted alternative hypotheses is employed. If the alternative 
hypothesis the researcher selects is supported, the null hypothesis is rejected." 


IV. Test Computations 


In order to compute the value of G, it is necessary to determine the number of pairs of subjects 
who are concordant with respect to the ordering of their scores on the X and Y variables (which 
will be represented by the notation nç), and the number of pairs of subjects who are discordant 
with respect to the ordering of their scores on the X and Y variables (which will be represented 
by the notation np). Upon computing the values of no and np, Equation 32.1 is employed to 
compute the value of Goodman and Kruskal's gamma. 


No - np : 
G = ———— (Equation 32.1) 
Nhe > np 


The determination of the values of nç and np is based on an analysis of the frequencies 
in the ordered contingency table (i.e., Table 32.1). Each of the cells in the table will be identified 
by two digits. The first digit will represent the row within which the cell falls, and the second 
digit will represent the column within which the cell falls. Thus, Cell,, is the cell in the i "h row 
and j column. As an example, since it is in both the first row and first column, the cell in the 
upper left hand corner of Table 32.1 is Cell,,. The number of subjects within each cell is 
identified by the notation n,,. Thus, in the case of Cell,,, n;, = 70. 

The protocol for determining the values of nọ and np will now be described. The 
following procedure is employed to determine the value of n. 

a) Begin with the cell in the upper left hand corner of the table (i.e., Cell,,). Determine 
the frequency of that cell (which will be referred to as the target cell), and multiply the frequency 
by the sum of the frequencies of all other cells in the table that fall both below it and to the 
right of it. In Table 32.1, the following six cells meet the criteria of being both below and to the 
right of Cell;;: Cell,,, Cell}, Cell,,, Cell,,, Cell44, Cell,,. Note that although Cell,, and Cell,, are 
below Cell,,, they are not to the right of it, and although Cell,,, Cell;4, and Cell,, fall to the right 
of Cell,,, they do not fall below it. Any subject who falls within a cell that is both below and to 
the right of Cell,, will form a concordant pair with any subject in Cell,,. The rationale for this 
is as follows: Assume that the values Ry and R y, represent the score/ranking/category of 


Subject j on the X and Y variables, and that Ry and R, represent the score/ranking/category of 
Subject j on the X and Y variables. Assume that Subject i is a subject in the target cell, and that 
Subject j is a subject in a cell that falls below and to the right of the target cell. We can state that 
the sign of the difference (Ry - R,) will be the same as the sign of the difference (Ry - Ry) 
when the scores of any subject in the target cell are compared with any subject in a cell that falls 
below and to the right of the target cell. When for any pair of subjects the signs of the 
differences (Ry - Ry) and (Ry = Ry ) are identical, that pair of subjects is concordant with 


respect to their ordering on the two variables. 
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To illustrate, each of the 70 subjects in Cell;, has a rank of 1 on both of the variables, and 
each of the 60 subjects in Cell, (which is one of the cells below and to the right of Cell,,) has 
a rank of 2 on both of the variables. Any pair of subjects that is formed by employing one 
subject from Cell,, and one subject from Cell,, will be concordant with respect to their ordering 
on the two variables, since for each D the sign of the difference between the ranks on both 
variables will be negative (i.e., (Ry - Ry) = (1 - 2) = -1 and (Ry -R y) = (1 - 2) = -1). 
If, on the other hand, we compare the ce on both variables for any subject who is in the target 
cell with the ranks on both variables for any subject who is in a cell that is not below and to the 
right of the target cell, (Ry - Ry j) and (R, - R y) will have different signs or will equal zero. 

The expression which summarizes the ada of the frequency of Cell,, and the sum of 
the frequencies of all the cells that fall both below and to the right of it is as follows: 
aL, + ny + Nyy + Nyy + Nz, + N). Substituting the appropriate frequencies from Table 
32.1, we obtain 70(60 + 20 + 10 + 15 + 35 + 40) = (70)(180) = 12600. This latter value will be 
designated as Product 1. 

b) The same procedure employed with Cell,, is applied to all remaining cells. Moving 
to the right in Row 1, the procedure is next employed with Cell,,. Product 2, which repre- 
sents the product for the second target cell, can be summarized by the expression 
DL, + Nyy + na n3), since Cell,;, Cell,,, Cell; and Cell,, are the only cells that fall 
both below and to the right of Cell,,. Thus, Product 2 will equal 15(20 + 10 + 35 + 40) 
= (15)(105) = 1575. 

c) Upon computing Product 2, products for the two remaining cells in Row 1 are 
computed, after which products are computed for each of the cells in Rows 2 and 3. The 
computation of the products for all 12 cells in the ordered contingency table is summarized in 
Table 32.2. Note that since many of the cells have no cell that falls both below and to the right 
of them, the value that the frequency of these cells will be multiplied by will equal zero, and thus 
the resulting product will equal zero. The value of nọ is the sum of all the products in Table 
32.2. For Example 32.1, no = 20875. 


Table 32.2. Computation of nç for Example 32.1 


Cell: 70 (60 + 20 + 10+ 15 + 35 + 40) 12600 Product 1 
Cell: 15 (20+ 10 + 35 + 40) 1575 Product 2 
Cell: 10 (10 40) 500 Product 3 
Cell,,: 5 (0) 0 Product 4 
Cell,: 10 (15 + 35 + 40) 900 Product 5 
Cell: 60 (35 + 40) 4500 Product 6 
Cell: 20 (40) 800 Product 7 


CelL,: 10 (0) 0 Product 8 
Cell: 10 (0) 0 Product 9 
Cell: 15 (0) 0 Product 10 
Cell,,: 35 (0) 0 Product 11 
0 Product 12 


Cell,,: 40 (0) 
n, = Sum of products = 20875 


Upon computing the value of n,., the following protocol is employed to compute the value 

of np- 
a) Begin with the cell in the upper right hand corner of the table (i.e., Cell,,). Determine 
the frequency of that cell, and multiply the frequency by the sum of the frequencies of all other 
cells in the table that fall both below it and to the left of it. In Table 32.1, the following six 
cells meet the criteria of being both below and to the left of Cell,,: Cell,,, Cell,,, Cell,3, Cell,,, 


Cell,;, Cell,,. Note that although Cell,, and Cell,, are below Cell,,, they are not to the left of it, 
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and although Cell,,, Cell;;, and Cell, fall to the left of Cell,,, they do not fall below it. Any 
subject who falls within a cell that is both below and to the left of Cell,, will form a discordant 
pair with any subject in Cell,,. The general rule that can be stated with respect to discordant pairs 
is as follows (if we assume that Subject i is a subject in the target cell, and Subject j is a subject 
in some other cell): The sign of the difference (Ry - R,) will be different than the sign of the 
difference (R, - R y) when the scores of any subject i in the target cell are compared with any 


subject in a cell that falls below and to the left of the target cell. When for any pair of subjects 
the signs of the differences (Ry - Ry j) and (R, - R y) are different, that pair of subjects is 
discordant with respect to their ordering on the two gar ties 

To illustrate, each of the 5 subjects in Cell,, has a rank of 1 on weight and a rank of 4 on 
birth order. Each of the 20 subjects in Cell,, (which is one of the cells below and to the left of 
Cell,,) has a rank of 2 on weight and a rank of 3 on birth order. Any pair of subjects that is 
formed by employing one subject from Cell,, and one subject from Cell,,, will be discordant with 
respect to the ordering of the ranks of the subjects on the two variables, since for each pair the 
signs of the difference e the ranks on both variables will be different (i.e., (Ry - Ry) 
=(1 - 2)=~-1 and (Ry Ry) = = (4 - 3) =+1). If, on the other hand, we compare the ranis 
on both variables for any subj ect who is in the target cell with the ranks on both variables for any 
subject who is in a cell that is not below and to the left of the target cell, (Ry - Ry) and 
(Ry = Ry ) will have the same sign or will equal zero. ' 


The expression which summarizes the product of the frequency of Cell,, and the sum of 
the frequencies of all cells that fall both below it and to the left of it is as follows: 
Bab, + ny + na + Nz, + Nz, + ng). Substituting the appropriate frequencies, we obtain 
5(10 + 60 + 20 + 10 + 15 + 35) = (5)(150) = 750. As is the case in determining the number of 
concordant pairs, we will designate the product for the first cell that is analyzed as Product 1. 

b) The same procedure employed with Cell,, is applied to all remaining cells. Moving 
to the left in Row 1, the procedure is next employed with Cell,,. Product 2, which repre- 
sents the product for the second target cell, can be summarized by the expression 
aub, + ny + Nz, + nj), since Cell,,, Cell, Cell;;, and Cell; are the only cells that fall 
both below and to the left of Cell,,. Thus, Product 2 will equal 10(10 + 60 + 10 + 15) 2 (10)(95) 
= 950. 

c) Upon computing Product 2, products for the two remaining cells in Row 1 are 
computed, after which products are computed for each of the cells in Rows 2 and 3. The 
computation of the products for all 12 cells in the ordered contingency table is summarized in 
Table 32.3. Note that since many of the cells have no cell that falls both below and to the left of 
them, the value that the 


Table 32.3 Computation of np for Example 32.1 


750 Product 1 
950 Product 2 
300 Product 3 

0 Product 4 
600 Product 5 
500 Product 6 
600 Product 7 


Cell: 5 (10 + 60 + 20 + 10+ 15 + 35) 
Cell: 10 (10 60 + 10 + 5) 

Cell,,: 15 (10 + 10) 

Cell,,: 70 (0) 

Cell,: 10 (10 + 15 + 35) 

Cell: 20 (10 + 15) 

Cell: 60 (10) 


Cell,,: 10 (0) 0 Product 8 
Cell,,: 40 (0) 0 Product 9 
Cell;;: 35 (0) O0 Product 10 
Cell: 15 (0) O0 Product 11 
0 Product 12 


Cell,,: 10 (0) 
np = Sum of products = 3700 
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frequency of such cells will be multiplied by will equal zero, and thus the resulting product will 
equal zero. The value of n, will be the sum of all the products in Table 32.3. For Example 32.1, 
np = 3700. 

Substituting the values no = 20875 and n, = 3700 in Equation 32.1, the value G = .70 
is computed. Note that the value of G is positive, since the number of concordant pairs is greater 
than the number of discordant pairs. 


G = 20875 - 3700 y 
20875 + 3700 


The value G = .70 can also be computed employing the definition of gamma presented in 
Section I. Specifically: 


20875 3700 
G = p(C) - p) - 52. - 2^7. 
PC) - PAD) = 34575 — 24575 


In the above equation the value 24575 is the total number of pairs (to be designated n,), 
which is the denominator of Equation 32.1. Thus, p(C) = n/n, and p(D) = ny/n;,. 

Since the computed value G = .70 is close to 1, it indicates the presence of a strong 
positive/direct relationship between the two variables. Specifically, it suggests that the higher 
the rank of a subject's weight category, the higher the rank of the subject's birth order category. 


V. Interpretation of the Test Results 


Test 32a: Test of significance for Goodman and Kruskal's gamma When the sample size 
is relatively large (which will generally be the case when gamma is computed), the computed 
value of G can be evaluated with Equation 32.2. To be more specific, Equation 32.2 (which 
employs the normal distribution) is employed to evaluate the null hypothesis Hj: y = 0 ^ The 
sign of the z value computed with Equation 32.2 will be the same as the sign of the value 
computed for G. 


z=G —— (Equation 32.2) 
Nü - G^) 


Where: Nis the total number of subjects for whom scores are recorded in the ordered con- 
tingency table 


When the appropriate values from Example 32.1 are substituted in Equation 32.2, the value 


z = 8.87 is computed. 
70 20875 + 3700 _ 8.87 
300[1 - (.70)]] 


Equation 32.3 is an alternative equation for computing the value of z. The denominator of 
Equation 32.3 represents the standard error of the G statistic (which will be represented by 
the notation SE.). In Section VI, SE, is employed to compute a confidence interval for 
gamma. 
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(Equation 32.3) 








CREE REPRE ES. 
SE, 1 1 0789 
no + np | 20875 + 3700 


The computed value z = 8.87 is evaluated with Table Al (Table of the Normal 
Distribution) in the Appendix.’ In the latter table, the tabled critical two-tailed .05 and .01 
values are Z), = 1.96 and zy, = 2.58, and the tabled critical one-tailed .05 and .01 values are 
Zos = 1.65 and Zo = 2.33. 

The following guidelines are employed in evaluating the null hypothesis. 

a) If the nondirectional alternative hypothesis H,: y # 0 is employed, the null hypothesis 
can be rejected if the obtained absolute value of z is equal to or greater than the tabled critical 
two-tailed value at the prespecified level of significance. 

b) If the directional alternative hypothesis H,: y > 0 isemployed, the null hypothesis can 
be rejected if the sign of z is positive, and the value of z is equal to or greater than the tabled 
critical one-tailed value at the prespecified level of significance. 

c) If the directional alternative hypothesis H,: y < 0 isemployed, the null hypothesis can 
be rejected if the sign of z is negative, and the absolute value of z is equal to or greater than the 
tabled critical one-tailed value at the prespecified level of significance. 

Employing the above guidelines, the nondirectional alternative hypothesis H,: y # 0 is 
supported at both the .05 and .01 levels, since the computed value z = 8.87 is greater than the 
tabled critical two-tailed values Z; = 1.96 and z,, = 2.58. The directional alternative 
hypothesis H,: y > 0 is supported at both the .05 and .01 levels, since the computed value 
z = 8.87 is a positive number that is greater than the tabled critical one-tailed values z 4; = 1.65 
and Zo, = 2.33. The directional alternative hypothesis H,: y < 0 is not supported, since 
the computed value z = 8.87 is a positive number. 

A summary of the analysis of Example 32.1 follows: It can be concluded that there is a 
significant positive relationship between weight and birth order. 


VI. Additional Analytical Procedures for Goodman and Kruskal’s 
Gamma and/or Related Tests 


1. The computation of a confidence interval for the value of Goodman and Kruskal’s 
gamma Equation 32.4 is employed to compute a confidence interval for a computed value of 
gamma. 


CI 


Q-a) ~ G E (SES) (Equation 32.4) 


Where:  z,, represents the tabled critical two-tailed value in the normal distribution below 
which a proportion (percentage) equal to [1 - (0/2)] of the cases falls. If the pro- 
portion (percentage) of the distribution that falls within the confidence interval is 
subtracted from 1 (100%), it will equal the value of a. 


Equation 32.4 will be employed to compute the 95% confidence interval for gamma. Along 


with the tabled critical two-tailed .05 value zo, = 1.96, the following values computed for 
Example 32.1 are substituted in Equation 32.4: G = .70 and SE, = .0789 (which is the 
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computed value of the denominator of Equation 32.3, which as noted in Section V represents the 
standard error of G). 


Cl, = .70 + (1.96)(.0789) = .70 + .15 


Subtracting from and adding .15 to .70, yields the values .55 and .85. Thus, the researcher 
can be 95% confident (or the probability is .95) that the true value of gamma in the underlying 
population falls between .55 and 85. Symbolically, this can be written as follows: .55 < y < .85. 


2. Test 32b: Test for evaluating the null hypothesis H,: y, = y, | Marascuilo and 
McSweeney (1977) note that Equation 32.5 can be employed to determine whether or not there 
is a significant difference between two independent values of gamma. Use of Equation 32.5 
assumes that the following conditions have been met: a) The sample size in each of two ordered 
contingency tables is large enough for evaluation with the normal approximation; b) The values 
of r and c are identical in the two ordered contingency tables; and c) The same row and column 
categories are employed in the two ordered contingency tables. 


G -G 
zasl M M (Equation 32.5) 


Where: G, and G, are the computed values of gamma for the two ordered contingency 
tables, and SES and SEG are the computed values of the standard error for the 


two values of gamma 


To illustrate the use of Equation 32.5, assume that the study described in Example 32.1 
is replicated with a different sample comprised of N = 600 subjects. The obtained value of 
gamma for the sample is G = .50, with SE, = .0438. By employing the values G, = .70, 
SEG = .0789, G, = .50, and SEG 7 .0438 in Equation 32.5, the researcher can evaluate the 
null hypothesis Hy: y, = y,. Substituting the appropriate values in Equation 32.5 yields the 
value z = .57. 


200 2-5 
V.0789 + .0438 


The same guidelines described for evaluating the alternative hypotheses H,: y * 0, 
H,: y > O,and H: y < Oare,respectively, employed for evaluating the alternative hypoth- 
eses Hi: y, * Y, H: Y, > y,,and H: y, < y,. Thenondirectional alternative hypothesis 
Hoy, # Y, is not supported, since the computed value z = .57 is less than the tabled critical 
two-tailed value z,, = 1.96. The directional alternative hypothesis H,: y, > Y, is not 
supported, since the computed value z = .57 is less than the tabled critical one-tailed value 
Zos = 1.65. The directional alternative hypothesis H,: y, < Y, is not supported, since the 
computed value z = .57 is a positive number. The fact that the difference y, - y, 2.7 - .5- 
.2 (which is reasonably large) is not significant, can be attributed to the fact that both samples 
have relatively large standard errors. 


= 57 


3. Sources for computing a partial correlation coefficient for Goodman and Kruskal’s 
gamma A procedure developed by Davis (1967) for computing a partial correlation for gamma 
is described in Marascuilo and McSweeney (1977). 
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VII. Additional Discussion of Goodman and Kruskal’s Gamma 


1. Relationship between Goodman and Kruskal’s gamma and Yule’s Q In Section Lit is 
noted that Yule's Q is a special case of Goodman and Kruskal’s gamma. To illustrate this, 
assume that the four cells in Tables 16.2/16.3 (for which Yule's Q is computed) represent a 
2 x 2 contingency table in which the cells on both the row and the column variables are ordered. 
If the procedure described for determining concordant pairs is employed with the data in Tables 
16.2/16.3, the only cell that will generate a product other than zero is Cell,, (which corresponds 
to Cell a within the framework of the notation used for a 2 x 2 contingency table). Specifically, 
the product for Cell;;, which will correspond to the value of no, is (1,,)(,,) = (30)(40) = 1200. 
In the same respect, if the procedure described for determining discordant pairs is employed, the 
only cell that will generate a product other than zero is Cell,, (which corresponds to Cell b within 
the framework of the notation used for a 2 x 2 contingency table). Specifically, the product for 
Cell,,, which will correspond to the value of np, is (n,,)(7,,) = (70)(60) = 4200. When the 
values no = 1200 and n, = 4200 are substituted in Equation 32.1, G = (1200 - 4200)/(1200 
+ 4200) = -.56. Note that this result is identical to that obtained when Equation 16.20 is 
employed to compute Yule's Q for the same set of data: Q = (ad - bc)/(ad + bc) = [(30)(40) 
- (70)(60)]/[(30)(40) + (70)(60)] = -.56. It should be noted that unlike gamma, which is only 
employed with ordered contingency tables, Yule's Q can be employed with both ordered and 
unordered 2 x 2 contingency tables. 


2. Somers' delta as an alternative measure of association for an ordered contingency table 
Somers (1962) has developed an alternative measure of association for ordered contingency 
tables referred to as delta (which is represented by the upper case Greek letter A). Siegel and 
Castellan (1988) identify delta as an asymmetrical measure of association (as opposed to a 
symmetrical measure of association). An asymmetrical measure of association is employed 
when one variable is distinguished in a meaningful way from the other variable (e.g., within the 
context of the study, one variable is more important than the other, or one variable represents an 
independent variable and the other a dependent variable). Within this framework, gamma is 
viewed as a symmetrical measure of association, since it does not assume a meaningful 
distinction between the variables within the context noted above. A full discussion of Somers’ 
delta can be found in Siegel and Castellan (1988). 


VIII. Additional Examples Illustrating the Use of Goodman and 
Kruskal's Gamma 


Examples 32.2 and 32.3 are two additional examples that can be evaluated with Goodman and 
Kruskal's gamma. Since Examples 32.2 and 32.3 employ the same data as Example 32.1, they 
yield the same result. Example 32.4 describes the identical study described by Example 32.1, 
but uses a different configuration of data in order to illustrate the computation of a negative value 
for gamma. 


Example 32.2. A consumer group conducts a survey in order to determine whether a rela- 
tionship exists between customer satisfaction and the price a person pays for an automobile. 
Each of 300 individuals who has purchased a new vehicle within the past year is classified in one 
of four categories based on the purchase price of one's automobile. Each subject is also clas- 
sified in one of three categories with respect to how satisfied he or she is with one's automobile. 
The results are summarized in Table 32.4. Do the data indicate there is a relationship between 
the price of an automobile and degree of satisfaction? 
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Example 32.3 A panel of psychiatrists wants to determine whether a relationship exists between 
the number of years a patient is in psychotherapy and the degree of change in a patient's 
behavior. Each of 300 patients is categorized with respect to one of four time periods during 
which he or she is in psychotherapy, and one of three categories with respect to the change in 
behavior he or she has exhibited since initiating therapy. Specifically, the following four cate- 
gories are employed with respect to psychotherapy duration: less than one year; one to two 
years; more than two years to three years; more than three years. 7he following three 
categories are employed with respect to changes in behavior: deteriorated (—), no change, 
improved (+). Table 32.5 summarizes the data. Do the data indicate there is an association 
between the amount of time a patient is in psychotherapy and the degree to which he or she 
changes ?* 


Table 32.4. Summary of Data for Example 32.2 


Purchase price Row sums 


Under $10,000 to — $18,001 to More than 
$10,000 $18,000 $30,000 $30,000 





Level of Below average 100 
satisfaction Average ps 
Above average 100 
Column sums 90 90 65 55 300 
Table 32.5 Summary of Data for Example 32.3 
Number of years in psychotherapy Row sums 
More than 

Less than Onetotwo two yearsto More than 

one year years three years three years 
= 100 
Ta No change 100 
g 4 100 





Column sums 90 90 65 55 300 


Example 32.4 A researcher wants to determine whether or not a relationship exists between 
a person's weight and birth order. Upon determining the weight and birth order of 300 
subjects, each subject is categorized with respect to one of three weight categories and one of 
four birth order categories. Specifically, the following three categories are employed with 
respect to weight: below average, average, above average. The following four categories are 
employed with respect to birth order: first born, second born, third born, fourth born and 
all subsequent birth orders. Table 32.6 summarizes the data. Do the data indicate there is a 
significant association between a person's weight and birth order? 


Table 32.6 Summary of Data for Example 32.4 


Birth order Row sums 
1st born 2nd born 3rd born 4th born+ 





Below average 100 

Weight Average 100 
Above average 100 

Column sums 55 65 90 90 300 
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Inspection of the data reveals that the cell frequencies in Table 32.6 are the mirror image 
of those employed in Table 32.1. By virtue of employing the same frequencies in an inverted 
format, the values of n. and np for Table 32.6 are the reverse of those obtained for Table 32.1. 
Thus, for Table 32.6, no = 3700 and np = 20875. Consequently, employing Equation 32.1, 
G = (3700 - 20875)/(3700 + 20875) = -.70. Because the same configuration of data is employed 
in an inverted format, the value G = -.70 computed for Table 32.6 is the same absolute value 
computed for Table 32.1. Note that the negative correlation G = -.70 indicates that a subject’ s 
birth order is inversely related to his weight. Specifically, subjects in a low birth order category 
are more likely to be above average in weight, while subjects in a high birth order category are 
more likely to be below average in weight. 
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Endnotes 


1. The general model for an r x c contingency table (which is summarized in Table 16.1) is 
discussed in Section I of the chi-square test for r x c tables (Test 16). 


2. Gamma can also be computed if the ordering is reversed — i.e. within both variables, the 
first row/column represents the category with the highest magnitude, and the last row/column 


represents the category with the lowest magnitude. 


3. Some sources employ the following statements as the null hypothesis and the nondirec- 
tional alternative hypothesis for Goodman and Kruskal’s gamma: Null hypothesis: H: 
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Variables X and Y are independent of one another; Nondirectional alternative hypothesis: 
H,: Variables X and Y are not independent of one another. 

It is, in fact, true that if in the underlying population the two variables are independent, 
the value of y will equal zero. However, Siegel and Castellan (1988) note that if y = 0, the 
latter does not in and of itself ensure that the two variables are independent of one another 
(unless the contingency table is a 2 x 2 table). 


4. Equation 32.2 can also be written in the following form: 


z-(G- y) CR ED. 
NA - G?) 


In the above equation, y represents the value of gamma stated in the null hypothesis. 
When the latter value equals zero, the above equation reduces to Equation 32.2. When some 
value other than zero is stipulated for gamma in the null hypothesis, the equation noted 
above can be employed to evaluate the null hypothesis Hy: Y = y, (where Y, represents the 
value stipulated for the population correlation). 


5. Sources that discuss the evaluation of the null hypothesis Hy: y = 0 note that the nor- 
mal approximation computed with Equations 32.2/32.3 tends to be overly conservative. 
Consequently, the likelihood of committing a Type I error (i.e., rejecting H) when it is true) 
is actually less than the value of alpha employed in the analysis. 


6. It could be argued that it might be more appropriate to employ Somers’ delta (which is 
briefly discussed in Section VII) rather than gamma as a measure of association for Example 
32.3. The use of delta could be justified, if within the framework of a study the number of 
years of therapy represents an independent variable and the amount of change represents the 
dependent variable. In point of fact, depending upon how one conceptualizes the relationship 
between the two variables, one could also argue for the use of delta as a measure of 
association for Example 32.1. In the final analysis, it will not always be clear whether it is 
more appropriate to employ gamma or delta as a measure of association for an ordered 
contingency table. 
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Appendix: Tables 


Acknowledgments and Sources for Tables in Appendix 


Table A1 Table of the Normal Distribution 
Reprinted with permission of CRC Press, Boca Raton, Florida from W. H. Beyer (1968), 
CRC Handbook of Tables for Probability and Statistics (2nd ed.), Table II.1 (The 
normal probability function and related functions), pp. 127-134. 


Table A2 Table of Student's t Distribution 
Reprinted with permission from Table 12 (Percentage points for the t distribution) in E. S. 
Pearson and H. O. Hartley, eds. (1970), Biometrika Tables for Statisticians (3rd ed., 
Volume I). New York: Cambridge University Press. Reproduced with kind permission 
of Biometrika trustees. 


Table A3 Power Curves for Student's t Distribution 
Reprinted with permission of Addison-Wesley Longman Publishing Company, Inc. from 
Table 2.2 (Graphs of the operating characteristics of Student's ¢ test) in D. B. Owen 
(©1962), Handbook of Statistical Tables. Reading, MA: Addison-Wesley, pp. 32-35. 


Table A4 Table of the Chi-Square Distribution 
Reprinted with permission from Table 8 (Percentage points of the x? distribution) in E. S. 
Pearson and H. O. Hartley, eds. (1970), Biometrika Tables for Statisticians (3rd ed., 
Volume I). New York: Cambridge University Press. Reproduced with kind permission 
of Biometrika trustees. 


Table A5 Table of Critical T Values for Wilcoxon's Signed Ranks and Matched-Pairs 
Signed-Ranks Tests 
Material from Table II in F. Wilcoxon, S. K. Katti and R. A. Wilcox (1963), Critical 
Values and Probability Levels for the Wilcoxon Rank Sum Test and the Wilcoxon 
Signed Rank Test. Copyright O 1963, American Cyanamid Company, Lederle Labora- 
tories Division. All rights reserved and reprinted with permission. 


Table A6 Table of the Binomial Distribution, Individual Probabilities 
Reprinted with permission of CRC Press, Boca Raton, Florida from W. H. Beyer (1968), 
CRC Handbook of Tables for Probability and Statistics (2nd ed.), Table II.1 
(Individual terms, binomial distribution), pp. 182-193. 


Table A7 Table of the Binomial Distribution, Cumulative Probabilities 
Reprinted with permission of CRC Press, Boca Raton, Florida from W. H. Beyer (1968), 
CRC Handbook of Tables for Probability and Statistics (2nd ed.), Table III.2 (Cumu- 
lative terms, binomial distribution), pp. 194—205. 


Table A8 Table of Critical Values for the Single-Sample Runs Test 


Reprinted with permission of Institute of Mathematical Statistics, Hayward, CA from the 
following: Portions of Table II on pp. 83-87 from: F. S. Swed and C. Eisenhart (1943). 
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Tables for testing randomness of grouping in a sequence of alternatives, Annals of 
Mathematical Statistics, 14, 66-87. 


Table A9 Table of the Fmax Distribution 
Reprinted with permission from Table 31 (Percentage points of the ratio Seal Soin) in E. S. 
Pearson and H. O. Hartley, eds. (1970). Biometrika Tables for Statisticians (3rd ed., 
Volume 1). New York: Cambridge University Press. Reproduced with the kind 
permission of the Biometrika trustees. 


Table A10 Table of the F Distribution 

Reprinted with permission from Table 18 (Percentage points of the F-distribution (variance 
ratio)) in E. S. Pearson and H. O. Hartley (eds.) (1970), Biometrika Tables for Statis- 
ticians (3rd ed., Volume 1). New York: Cambridge University Press. Reproduced with 
the kind permission of the Biometrika trustees. Table reproduced with permission of CRC 
Press, Boca Raton, Florida from W. H. Beyer (1968), CRC Handbook of Tables for 
Probability and Statistics (2nd ed.), Table VI.1 (Percentage Points, F Distribution), pp. 
304-310. 


Table A11 Table of Critical Values for Mann-Whitney U Statistic 
Reprinted with permission of Indiana University from D. Auble (1953), Extended Tables 
for the Mann-Whitney Statistic. Bulletin of the Institute of Educational Research at 
Indiana University Vol. 1, No. 2. Table reproduced with permission of CRC Press, Boca 
Raton, Florida from W. H. Beyer (1968), CRC Handbook of Tables for Probability and 
Statistics (2nd ed.), Table X.4 (Critical Values of U in the Wilcoxon (Mann-Whitney) 
Two-Sample Statistic), pp. 405-408. 


Table A12 Table of Sandler's A Statistic 
Reprinted with permission of British Psychological Society and Joseph Sandler from J. 
Sandler (1955), A test of the significance of difference between the means of correlated 
measures based on a simplification of Student's t. British Journal of Psychology, 46, pp. 
225-226. 


Table A13 Table of the Studentized Range Statistic 
Reprinted with permission from Table 29 (Percentage points of the studentized range) in 
E. S. Pearson and H. O. Hartley, eds. (1970), Biometrika Tables for Statisticians (3rd ed., 
Volume 1). New York: Cambridge University Press. Reproduced with the kind 
permission of the Biometrika trustees. 


Table A14 Table of Dunnett’s Modified ¢ Statistic for a Control Group Comparison 

Two-tailed values: Reprinted with permission of the Biometric Society, Alexandria, VA 
from: C. W. Dunnett (1964), New tables for multiple comparisons with a control. Bio- 
metrics, 20, pp. 482-491. 
One-tailed values: Reprinted with permission of the American Statistical Association, 
Alexandria, VA from: C. W. Dunnett (1955). A multiple comparison procedure for com- 
paring several treatments with a control. Journal of the American Statistical 
Association, 50, pp. 1096-1121. 


Table A15 Graphs of the Power Function for the Analysis of Variance 
Reprinted with permission of Biometrika from E. S. Pearson and H. O. Hartley (1951), 
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Charts of the power function for analysis of variance tests, derived from the non-central 
F distribution, Biometrika, 38, pp. 112-130. 


Table A16 Table of Critical Values for Pearson r 
Reprinted with permission from Table 13 (Percentage points for the distribution of the 
correlation coefficient, r, when p = 0) in E. S. Pearson and H. O. Hartley, eds. (1970), 
Biometrika Tables for Statisticians (3rd ed., Volume 1). New York: Cambridge Uni- 
versity Press. Reproduced with the kind permission of the Biometrika trustees. 


Table A17 Table of Fisher's z, Transformation 
Reprinted with permission from Table 14 (The z-transformation of the correlation coeffi- 
cient, z - tanh ! r) in E. S. Pearson and H. O. Hartley, eds. (1970), Biometrika Tables 
for Statisticians (3rd ed., Volume 1). New York: Cambridge University Press. 
Reproduced with the kind permission of the Biometrika trustees. 


Table A18 Table of Critical Values for Spearman's Rho 
Reprinted with permission of the American Statistical Association, Alexandria, VA from: 
J. H. Zar (1972), Significance testing of the Spearman rank correlation coefficient. Journal 
of the American Statistical Association, 67, pp. 578—580 (Table 1, p. 579). 


Table A19 Table of Critical Values for Kendall’s Tau 
Reprinted with permission of Blackwell Publishers and Statistica Neerlandica, from Table 
III in L. Kaarsemaker and A. van Wijngaarden (1953), Tables for use in rank correlation. 
Statistica Neerlandica, 7, pp. 41-54 (Copyright: The Netherlands Statistical Society 
(VVS)). 


Table A20 Table of Critical Values for Kendall’s Coefficient of Concordance 
Reprinted with permission of Institute of Mathematical Statistics, Hayward, CA from: M. 
Friedman (1940), A comparison of alternative tests of significance for the problem of m 
rankings. Annals of Mathematical Statistics, 11, 86-92 (Table III, p. 91). 


Table A21 Table of Critical Values for the Kolmogorov-Smirnov Goodness-of-Fit Test for 
a Single Sample 
Reprinted with permission of Institute of Mathematical Statistics, Hayward, CA from: L. 
H. Miller (1956). Table of percentage points of Kolmogorov statistics. Journal of the 
American Statistical Association, 51, pp. 111—121. 


Table A22 Table of Critical Values for the Lilliefors Test for Normality 
Reprinted with permission of Institute of Mathematical Statistics, Hayward, CA from: 
H. W. Lilliefors (1967). On the Kolmogorov-Smirnov test for normality with mean and 
variance unknown. Journal of the American Statistical Association, 62, pp. 399—402. 


Table A23 Table of Critical Values for the Kolmogorov-Smirnov Test for Two Indepen- 
dent Samples 
Reprinted with permission of Institute of Mathematical Statistics, Hayward, CA from: F. 
J. Massey Jr. (1952). Distribution tables for the deviation between two sample cumulatives. 
Annals of Mathematical Statistics, 23, pp. 435-441. 
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€ 


.00 
.01 
.02 
.03 
.04 


.05 
.06 
.07 
.08 
.09 


.10 
.11 
.12 
.13 
.14 


.15 
.16 
17 
18 
19 


.20 
21 
22 
23 
24 


25, 
.26 
27 
28 
29 


30 
31 
32 
.33 
.34 


.35 
.36 
.37 
.38 
.39 


40 
Al 
42 
43 
44 


p(p toz) 


.0000 
.0040 
.0080 
.0120 
.0160 


.0199 
.0239 
.0279 
.0319 
.0359 


.0398 
.0438 
.0478 
.0517 
.0557 


.0596 
.0636 
.0675 
.0714 
.0753 


.0793 
.0832 
.0871 
.0901 
.0948 


.0987 
.1026 
.1064 
.1103 
.1141 


.1179 
.1217 
.1255 
.1293 
.1331 


.1368 
.1406 
.1443 
.1480 
.1517 


.1554 
.1591 
.1628 
.1664 
.1700 


Table A1 Table of the Normal Distribution 


p(z to tail) 


.5000 
.4960 
.4920 
.4880 
.4840 


.4801 
.4761 
4721 
4681 
4641 


4602 
4562 
4522 
4483 
4443 


.4404 
.4364 
.4325 
.4286 
.4247 


.4207 
.4168 
.4129 
.4090 
.4052 


.4013 
.3974 
.3936 
.3897 
.3859 


.3821 
.3783 
3745 
3707 
3669 


3632 
3594 
3557 
3520 
3483 


3446 
3409 
3372 
3336 
.3300 
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ordinate 


.3989 
.3989 
.3989 
.3988 
.3986 


.3984 
.3982 
.3980 
3977 
3973 


3970 
3965 
3961 
3956 
3951 


3945 
3939 
3932 
3925 
3918 


3910 
3902 
3894 
3885 
3876 


3867 
3857 
3847 
3836 
3825 


3814 
3802 
.3790 
3778 
3765 


3752 
3739 
3725 
3712 
3697 


3683 
3668 
3653 
3637 
3621 


r4 


.45 
.46 
A7 
.48 
.49 


50 
51 
52 
53 
54 


55 
56 
57 
58 
59 


.60 
.61 
.62 
.63 
.64 


.65 
.66 
.67 
.68 
.69 


10 
71 
72 
43 
74 


15 
76 
TI 
78 
79 


.80 
81 
82 
.83 
.84 


.85 
.86 
.87 
.88 
.89 


p(y to z) 


.1736 
.1772 
.1808 
.1844 
.1879 


.1915 
.1950 
.1985 
.2019 
.2054 


.2088 
.2123 
.2157 
.2190 
.2224 


.2257 
.2291 
.2324 
.2357 
.2389 


.2422 
.2454 
.2486 
.2517 
.2549 


.2580 
.2611 
.2642 
.2673 
.2704 


.2734 
.2764 
.2794 
.2823 
.2852 


.2881 
.2910 
.2939 
.2967 
.2995 


.3023 
.3051 
.3078 
.3106 
.3133 


p(z to tail) 


.3264 
.3228 
.3192 
.3156 
.3121 


.3085 
.3050 
.3015 
.2981 
.2946 


.2912 
2877 
.2843 
.2810 
.2776 


.2743 
.2709 
.2676 
.2643 
.2611 


.2578 
.2546 
.2514 
.2483 
.2451 


.2420 
.2389 
.2358 
.2327 
.2296 


.2266 
.2236 
.2206 
.2177 
.2148 


.2119 
.2090 
.2061 
.2033 
.2005 


.1977 
.1949 
.1922 
.1894 
.1867 


ordinate 


.3605 
.3589 
3572 
3555 
3538 


3521 
3503 
3485 
3467 
3448 


3429 
3410 
3391 
3312 
.3352 


.3332 
.3312 
.3292 
.3271 
.3251 


.3230 
.3209 
.3187 
.3166 
.3144 


.3123 
.3101 
.3079 
.3056 
.3034 


3011 
2989 
.2966 
.2943 
.2920 


.2897 
.2874 
.2850 
.2827 
.2803 


.2780 
.2756 
.2732 
.2709 
.2685 


.90 
91 
92 
.93 
.94 


.95 
.96 
.97 
.98 
.99 


1.00 
1.01 
1.02 
1.03 
1.04 


1.05 
1.06 
1.07 
1.08 
1.09 


1.10 
1.11 
1.12 
1.13 
1.14 


1.15 
1.16 
1.17 
1.18 
1.19 


1.20 
1.21 
1.22 
1.23 
1.24 


1:25 
1.26 
1.27 
1.28 
1.29 


1.30 
1.31 
1.32 
1.33 
1.34 


Table A1 Table of the Normal Distribution (continued) 


p(u to z) 


.3159 
.3186 
.3212 
.3238 
.3264 


.3289 
.3315 
.3340 
.3365 
.3389 


.3413 
.3438 
.3461 
.3485 
.3508 


.3531 
.3554 
.3577 
.3599 
.3621 


.3643 
.3665 
.3686 
.3708 
.3729 


3749 
3770 
3790 
3810 
3830 


3849 
3869 
3888 
3907 
.3925 


.3944 
.3962 
.3980 
.3997 
.4015 


.4032 
.4049 
.4066 
.4082 
.4099 


p(z to tail) 


.1841 
.1814 
.1788 
.1762 
.1736 


A711 
1685 
.1660 
.1635 
.1611 


.1587 
.1562 
.1539 
.1515 
.1492 


.1469 
.1446 
.1423 
.1401 
.1379 


.1357 
.1335 
.1314 
.1292 
1271 


1251 
1230 
1210 
1190 
1170 


1151 
1131 
1112 
.1093 
.1075 


.1056 
.1038 
.1020 
.1003 
.0985 


.0968 
.0951 
.0934 
.0918 
.0901 
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ordinate 


.2661 
.2637 
.2613 
.2589 
.2565 


.2541 
.2516 
.2492 
.2468 
.2444 


.2420 
.2396 
2371 
.2347 
.2323 


.2299 
2275 
.2251 
2227 
.2203 


.2179 
.2155 
.2131 
.2107 
.2083 


.2059 
.2036 
.2012 
.1989 
.1965 


.1942 
.1919 
.1895 
.1872 
.1849 


.1826 
.1804 
.1781 
.1758 
.1736 


.1714 
.1691 
.1669 
.1447 
.1626 


1.35 
1.36 
1.37 
1.38 
1.39 


1.40 
1.41 
1.42 
1.43 
1.44 


1.45 
1.46 
1.47 
1.48 
1.49 


1.50 
1.51 
1.52 
1.53 
1.54 


1.55 
1.56 
1.57 
1.58 
1.59 


1.60 
1.61 
1.62 
1.63 
1.64 


1.65 
1.66 
1.67 
1.68 
1.69 


1.70 
1.71 
1.72 
1.73 
1.74 


1.75 
1.76 
1.77 
1.78 
1.79 


p(p toz) 


4115 
4131 
4147 
4162 
4177 


4192 
4207 
4222 
4236 
4251 


4265 
4279 
4292 
.4306 
.4319 


.4332 
.4345 
.4357 
.4370 
.4382 


.4394 
.4406 
.4418 
.4429 
.4441 


.4452 
.4463 
4474 
4484 
4495 


4505 
4515 
4525 
4535 
4545 


4554 
4564 
4573 
4582 
4591 


4599 
.4608 
.4616 
.4625 
.4633 


p(z to tail) 


.0885 
.0869 
.0853 
.0838 
.0823 


.0808 
.0793 
.0778 
.0764 
.0749 


.0735 
.0721 
.0708 
.0694 
.0681 


.0668 
.0655 
.0643 
.0630 
.0618 


.0606 
.0594 
.0582 
.0571 
.0559 


.0548 
.0537 
.0526 
.0516 
.0505 


.0495 
.0485 
.0475 
.0465 
.0455 


.0446 
.0436 
.0427 
.0418 
.0409 


.0401 
.0392 
.0384 
.0375 
.0367 


ordinate 


.1604 
.1582 
.1561 
.1539 
.1518 


.1497 
.1476 
.1456 
.1435 
.1415 


.1394 
.1374 
.1354 
.1334 
.1315 


.1295 
.1276 
.1257 
.1238 
.1219 


.1200 
.1182 
.1163 
.1145 
.1127 


.1109 
.1092 
.1074 
.1057 
.1040 


.1023 
.1006 
.0989 
.0973 
.0957 


.0940 
.0925 
.0909 
.0893 
.0878 


.0863 
.0848 
.0833 
.0818 
.0804 


1.80 
1.81 
1.82 
1.83 
1.84 


1.85 
1.86 
1.87 
1.88 
1.89 


1.90 
1.91 
1.92 
1.93 
1.94 


1.95 
1.96 
1.97 
1.98 
1.99 


2.00 
2.01 
2.02 
2.03 
2.04 


2.05 
2.06 
2.07 
2.08 
2.09 


2.10 
2.11 
2.12 
2.13 
2.14 


2:15 
2.16 
2.17 
2.18 
2.19 


2.20 
2.21 
2.22 
2.23 
2.24 


Table A1 Table of the Normal Distribution (continued) 


P(p to z) 


.4641 
.4649 
.4656 
.4664 
.4671 


.4678 
.4686 
.4693 
.4699 
.4706 


.4713 
.4719 
.4726 
.4732 
.4738 


4744 
4750 
4756 
A761 
4767 


4772 
4778 
4783 
4788 
4793 


4798 
.4803 
.4808 
.4812 
.4817 


.4821 
.4826 
.4830 
.4834 
.4838 


.4842 
.4846 
.4850 
.4854 
.4857 


.4861 
.4864 
.4868 
.4871 
.4875 


p(z to tail) 


0359 
0351 
0344 
.0336 
.0329 


.0322 
.0314 
.0307 
.0301 
.0294 


.0287 
.0281 
.0274 
.0268 
.0262 


.0256 
.0250 
.0244 
.0239 
.0233 


.0228 
.0222 
.0217 
.0212 
.0207 


.0202 
.0197 
.0192 
.0188 
.0183 


.0179 
.0174 
.0170 
.0166 
.0162 


.0158 
.0154 
.0150 
.0146 
.0143 


.0139 
.0136 
.0132 
.0129 
.0125 
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ordinate 


.0790 
.0775 
.0761 
.0748 
.0734 


.0721 
.0707 
.0694 
.0681 
.0669 


.0656 
.0644 
.0632 
.0620 
.0608 


.0596 
.0584 
.0573 
.0562 
.0551 


.0540 
.0529 
.0519 
.0508 
.0498 


.0488 
.0478 
.0468 
.0459 
.0449 


.0440 
.0431 
.0422 
.0413 
.0404 


.0396 
.0387 
.0379 
.0371 
.0363 


.0355 
.0347 
.0339 
.0332 
.0325 


2.25 
2.26 
2.27 
2.28 
2.20 


2.30 
2.31 
2.32 
2.33 
2.34 


2.35 
2.36 
2.37 
2.38 
2.39 


2.40 
2.41 
2.42 
2.43 
2.44 


2.45 
2.46 
2.47 
2.48 
2.49 


2.50 
2.51 
2:52 
2.53 
2.54 


2.55 
2.56 
2.57 
2.58 
2.59 


2.60 
2.61 
2.62 
2.63 
2.64 


2.65 
2.66 
2.67 
2.68 
2.69 


p(p toz) 


.4878 
.4881 
.4884 
.4887 
.4890 


.4893 
.4896 
.4898 
.4901 
.4904 


.4906 
.4909 
.4911 
.4913 
.4916 


.4918 
.4920 
.4922 
4925 
4927 


4929 
4931 
4932 
4934 
.4936 


.4938 
4940 
4941 
4943 
4945 


4946 
4948 
4949 
4951 
4952 


4953 
4955 
4956 
4957 
4959 


4960 
4961 
4962 
4963 
4964 


p(z to tail) 


.0122 
.0119 
.0116 
.0113 
.0110 


.0107 
.0104 
.0102 
.0099 
.0096 


.0094 
.0091 
.0089 
.0087 
.0084 


.0082 
.0080 
.0078 
.0075 
.0073 


.0071 
.0069 
.0068 
.0066 
.0064 


.0062 
.0060 
.0059 
.0057 
.0055 


.0054 
.0052 
.0051 
.0049 
.0048 


.0047 
.0045 
.0044 
.0043 
.0041 


.0040 
.0039 
.0038 
.0037 
.0036 


ordinate 


.0317 
.0310 
.0303 
.0297 
.0290 


.0283 
.0277 
.0270 
.0264 
.0258 


.0252 
.0246 
.0241 
.0235 
.0229 


.0224 
.0219 
.0213 
.0208 
.0203 


.0198 
.0194 
.0189 
.0184 
.0180 


.0175 
.0171 
.0167 
.0163 
.0158 


.0155 
.0151 
.0147 
.0143 
.0139 


.0136 
.0132 
.0129 
.0126 
.0122 


.0119 
.0116 
.0113 
.0110 
.0107 


2.70 
2.71 
2.72 
2.73 
2.74 


2.75 
2.76 
2.77 
2.78 
2.79 


2.80 
2.81 
2.82 
2.83 
2.84 


2.85 
2.86 
2.87 
2.88 
2.89 


2.90 
2.91 
2.92 
2.93 
2.94 


2.95 
2.96 
2.97 
2.98 
2.99 


3.00 
3.01 
3.02 
3.03 
3.04 


3.05 
3.06 
3.07 
3.08 
3.09 


3.10 
3.11 
3.12 
3.13 
3.14 


Table A1 Table of the Normal Distribution (continued) 


P(p to z) 


.4965 
.4966 
.4967 
.4968 
.4969 


.4970 
.4971 
.4972 
.4973 
.4974 


.4974 
.4975 
.4976 
4977 
4977 


4978 
4979 
4979 
4980 
4981 


4981 
4982 
4982 
4983 
4984 


4984 
4985 
4985 
4986 
4986 


4987 
4987 
4987 
4988 
4988 


4989 
4989 
4989 
4990 
4990 


4990 
4991 
4991 
4991 
4992 


p(z to tail) 


.0035 
.0034 
.0033 
.0032 
.0031 


.0030 
.0029 
.0028 
.0027 
.0026 


.0026 
.0025 
.0024 
.0023 
.0023 


.0022 
.0021 
.0021 
.0020 
.0019 


.0019 
.0018 
.0018 
.0017 
.0016 


.0016 
.0015 
.0015 
.0014 
.0014 


.0013 
.0013 
.0013 
.0012 
.0012 


.0011 
.0011 
.0011 
.0010 
.0010 


.0010 
.0009 
.0009 
.0009 
.0008 
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ordinate 


.0104 
.0101 
.0099 
.0096 
.0093 


.0091 
.0088 
.0086 
.0084 
.0081 


.0079 
.0077 
.0075 
.0073 
.0071 


.0069 
.0067 
.0065 
.0063 
.0061 


.0060 
.0058 
.0056 
.0055 
.0053 


.0051 
.0050 
.0048 
.0047 
.0046 


.0044 
.0043 
.0042 
.0040 
.0039 


.0038 
.0037 
.0036 
.0035 
.0034 


.0033 
.0032 
.0031 
.0030 
.0029 


3.15 
3.16 
3.17 
3.18 
3.19 


3.20 
3.21 
3.22 
3.23 
3.24 


3.25 
3.26 
3.27 
3.28 
3.29 


3.30 
3:31 
9:92 
3:33 
3.34 


3.35 
3.36 
3.37 
3.38 
3.39 


3.40 
3.41 
3.42 
3.43 
3.44 


3.45 
3.46 
3.47 
3.48 
3.49 


3.50 
3.51 
3.52 
3.53 
3.54 


3.55 
3.56 
3.57 
3.58 
3.59 


p(p toz) 


4992 
4992 
4992 
4993 
4993 


4993 
4993 
4994 
4994 
4994 


4994 
4994 
4995 
4995 
4995 


4995 
4995 
4995 
.4996 
.4996 


.4996 
.4996 
.4996 
.4996 
4997 


4997 
4997 
4997 
4997 
4997 


4997 
4997 
4997 
4997 
4998 


4998 
4998 
4998 
4998 
4998 


4998 
4998 
4998 
4998 
4998 


p(z to tail) 


.0008 
.0008 
.0008 
.0007 
.0007 


.0007 
.0007 
.0006 
.0006 
.0006 


.0006 
.0006 
.0005 
.0005 
.0005 


.0005 
.0005 
.0005 
.0004 
.0004 


.0004 
.0004 
.0004 
.0004 
.0003 


.0003 
.0003 
.0003 
.0003 
.0003 


.0003 
.0003 
.0003 
.0003 
.0002 


.0002 
.0002 
.0002 
.0002 
.0002 


.0002 
.0002 
.0002 
.0002 
.0002 


ordinate 


.0028 
.0027 
.0026 
.0025 
.0025 


.0024 
.0023 
.0022 
.0022 
.0021 


.0020 
.0020 
.0019 
.0018 
.0018 


.0017 
.0017 
.0016 
.0016 
.0015 


.0015 
.0014 
.0014 
.0013 
.0013 


.0012 
.0012 
.0012 
.0011 
.0011 


.0010 
.0010 
.0010 
.0009 
.0009 


.0009 
.0008 
.0008 
.0008 
.0008 


.0007 
.0007 
.0007 
.0007 
.0006 


3.60 
3.61 
3.62 
3.63 
3.64 


3.65 
3.66 
3.67 
3.68 
3.69 


3.70 
3.71 
3.72 
3.73 
3.74 


3.75 
3.76 
3.77 
3.78 
3.79 


Table A1 Table of the Normal Distribution (continued) 


p(p to z) 


.4998 
.4998 
.4999 
.4999 
.4999 


.4999 
.4999 
.4999 
.4999 
.4999 


.4999 
.4999 
.4999 
.4999 
.4999 


.4999 
.4999 
.4999 
.4999 
.4999 


p(z to tail) 


.0002 
.0002 
.0001 
.0001 
.0001 


.0001 
.0001 
.0001 
.0001 
.0001 


.0001 
.0001 
.0001 
.0001 
.0001 


.0001 
.0001 
.0001 
.0001 
.0001 
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ordinate 


.0006 
.0006 
.0006 
.0005 
.0005 


.0005 
.0005 
.0005 
.0005 
.0004 


.0004 
.0004 
.0004 
.0004 
.0004 


.0004 
.0003 
.0003 
.0003 
.0003 


3.80 
3.81 
3.82 
3.83 
3.84 


3.85 
3.86 
3.87 
3.88 
3.89 


3.90 
3.91 
3.92 
3.93 
3.94 


3.95 
3.96 
3.97 
3.98 
3.99 


4.00 


p(p to z) 


.4999 
.4999 
.4999 
.4999 
.4999 


.4999 
.4999 
.4999 
.4999 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 
1.0000 
1.0000 
1.0000 
1.0000 


1.0000 


p(z to tail) 


.0001 
.0001 
.0001 
.0001 
.0001 


.0001 
.0001 
.0001 
.0001 
.0000 


.0000 
.0000 
.0000 
.0000 
.0000 


.0000 
.0000 
.0000 
.0000 
.0000 


.0000 


ordinate 


.0003 
.0003 
.0003 
.0003 
.0003 


.0002 
.0002 
.0002 
.0002 
.0002 


.0002 
.0002 
.0002 
.0002 
.0002 


.0002 
.0002 
.0002 
.0001 
.0001 


.0001 


Table A2 Table of Student's ¢ Distribution 


Two-tailed .80 .50 .20 .10 .05 .02 01 .001 
One-tailed 40 25 10 05 025 01 005 .0005 
Dp .60 45 .90 95 975 99 995 9995 
df 

1 325 1.000 3.078 6.314 12.706 31.821 63.657 636.619 

2 .289 .816 1.886 2.920 4.303 6.965 9.925 31.598 
3 277 .765 1.638 2.353 3.182 4.541 5.841 12.924 

4 271 741 1.533 2.132 2.776 3.747 4.604 8.610 

5 .267 727 1.476 2.015 2.571 3.365 4.032 6.869 

6 .265 .718 1.440 1.943 2.447 3.143 3.707 5.959 

7 .263 711 1.415 1.895 2.365 2.998 3.499 5.408 

8 .262 .706 1.397 1.860 2.306 2.896 3.355 5.041 

9 261 .703 1.383 1.833 2.262 2.821 3.250 4.781 
10 .260 .700 1.372 1.812 2.228 2.164 3.169 4.587 
11 .260 .697 1.363 1.796 2.201 2.718 3.106 4.437 
12 .259 .695 1.356 1.782 2.179 2.681 3.055 4.318 
13 .259 .694 1.350 1.771 2.160 2.650 3.012 4.221 
14 .258 .692 1.345 1.761 2.145 2.624 2.977 4.140 
15 .258 .691 1.341 1.753 2.131 2.602 2.947 4.073 
16 .258 .690 1:337 1.746 2.120 2.583 2.921 4.015 
17 .257 .689 1.333 1.740 2.110 2.567 2.898 3.965 
18 .257 .688 1.330 1.734 2.101 2.552 2.878 3.922 
19 .257 .688 1.328 1.729 2.093 2.539 2.861 3.883 
20 .257 .687 1.325 1.725 2.086 2.528 2.845 3.850 
21 .257 .686 1.323 1.721 2.080 2.518 2.831 3.819 
22 .256 .686 1.321 1.717 2.074 2.508 2.819 3.792 
23 .256 .685 1.319 1.714 2.069 2.500 2.807 3.767 
24 .256 .685 1.318 1.711 2.064 2.492 2.797 3.745 
25 .256 .684 1.316 1.708 2.060 2.485 2.787 3.725 
26 .256 .684 1.315 1.706 2.056 2.479 2.779 3.707 
27 .256 .684 1.314 1.703 2.052 2.473 2.771 3.690 
28 .256 .683 1.313 1.701 2.048 2.467 2.163 3.674 
29 .256 .683 1.311 1.699 2.045 2.462 2.756 3.659 
30 .256 .683 1.310 1.697 2.042 2.457 2.750 3.646 
40 .255 .681 1.303 1.684 2.021 2.423 2.704 3.551 
60 .254 .679 1.296 1.671 2.000 2.390 2.660 3.460 
120 .254 .677 1.289 1.658 1.980 2.358 2.617 3.373 
oo .253 .674 1.282 1.645 1.960 2.326 2.576 3.291 
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Table A3 Power Curves for Student's ¢ Distribution 


Table A3-A (Two-Tailed .01 and One-Tailed .005 Values) 


f — degrees of freedom 


0.01 99.99 
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Table A3 Power Curves for Student's ¢ Distribution (continued) 


Table A3-B (Two-Tailed .02 and One-Tailed .01 Values) 


f — degrees of freedom 


px 


nni c DL ninimi 
nun n 








uu - = i 


roe 
Peara rE V =. 
Pott HEH 
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1000 — 5 
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Table A3 Power Curves for Student's t Distribution (continued) 


Table A3-C (Two-Tailed .05 and One-Tailed .025 Values) 


f — degrees of freedom 


0.01 


99.99 
E E TET LETT 
LELETET Fr TT TERTII i ITTEIETIT] LLL 


ja. 
ud crar 
rr c AIRISRaL TS 
rt NS LL 
soa 
A 
Ae 


LL CELL = 
ct ene te pt 


Power 


100 [1 — 4) 


Lt) LLLN 
n FEIN SNNT e EETEHI 
mienne HEN AN ENN 





Tes 
$939 BA 


0.01 
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Table A3 Power Curves for Student's ¢ Distribution (continued) 


Table A3-D (Two-Tailed .10 and One-Tailed .05 Values) 


= degrees of freedom 





E c t S 


Power 


100 (1 — 8) 





even iT 
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p .005 
df 
1 .0393 
2 .0100 
3 .072 
4 .0207 
5 .412 
6 .676 
7 .989 
8 1.34 
9 1.73 
10 2.16 
11 2.60 
12 3.07 
13 3.57 
14 4.07 
15 4.60 
16 5.14 
17 5.70 
18 6.26 
19 6.84 
20 7.43 
2] 8.03 
22 8.64 
23 9.26 
24 9.89 
25 10.52 
26 11.16 
27 11.81 
28 12.46 
29 1321 
30 13.79 
40 20.71 
50 27.99 
60 35.53 
70 43.28 
80 51.17 
90 59.20 
100 67.33 


.010 


.0157 
.0201 
.115 
.297 


.554 

.872 
1.24 
1.65 
2.09 


2.56 
3.05 
3.57 
4.11 
4.66 


5.23 
5.81 
6.41 
7.01 
7.63 


8.26 
8.90 
9.54 
10.20 
10.86 


11.52 
12.20 
12.88 
13.56 
14.26 


14.95 
22.16 
29.71 
37.48 


45.44 
53.54 
61.75 
70.06 


Table A4 Table of the Chi-Square Distribution 


.025 


.0982 
.0506 
.216 
.484 


.831 
1.24 
1.69 
2.18 
2.70 


3.25 
3.82 
4.40 
5.01 
5.63 


6.26 
6.91 
7.56 
8.23 
8.91 


8.59 
10.28 
10.98 
11.69 
12.40 


13.12 
13.84 
14.57 
15.31 
16.05 


16.79 
24.43 
32.36 
40.48 


48.76 
57.15 
65.65 
74.22 
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.050 


.0393 
.103 
.352 
711 


1.145 
1.64 
2.17 
2.73 
3.33 


3.94 
4.57 
5.23 
5.89 
6.57 


7.26 
7.96 
8.67 
9.39 
10.12 


10.85 
11.59 
12.34 
13.09 
13.85 


14.61 
15.38 
16.15 
16.93 
17.71 


18.49 
26.51 
34.76 
43.19 


51.74 
60.39 
69.13 
71.93 


.100 


.0158 

211 

584 
1.064 


1.61 
2.20 
2.83 
3.49 
4.17 


4.87 
5.58 
6.30 
7.04 
7.79 


8.55 
9.31 
10.09 
10.86 
11.65 


12.44 
13.24 
14.04 
14.85 
15.66 


16.47 
17.29 
18.11 
18.94 
19.77 


20.60 
29.05 
37.69 
46.46 


55.33 
64.28 
73.29 
82.36 


.900 


2.71 
4.61 
6.25 
7.78 


9.24 
10.64 
12.02 
13.36 
14.68 


15.99 
17.28 
18.55 
19.81 
21.06 


22.31 
23.54 
24.77 
25.99 
27.20 


28.41 
29.62 
30.81 
32.01 
33.20 


34.38 
35.56 
36.74 
37.92 
39.09 


40.26 
51.80 
63.17 
74.40 


85.53 
96.58 
107.56 
118.50 


950 


3.84 
5.99 
7.81 
9.49 


11.07 
12.59 
14.07 
15.51 
16.92 


18.31 
19.68 
21.03 
22.36 
23.68 


25.00 
26.30 
27.59 
28.87 
30.14 


31.41 
32.67 
33.92 
35.17 
36.42 


37.65 
38.89 
40.11 
41.34 
42.56 


43.77 
55.76 
67.50 
79.08 


90.53 
101.88 
113.15 
124.34 


975 


5.02 
7.38 
9.35 
11.14 


12.83 
14.45 
16.01 
17.53 
19.02 


20.48 
21.92 
23.34 
24.74 
26.12 


27.49 
28.85 
30.19 
31.53 
32.85 


34.17 
35.48 
36.78 
38.08 
39.36 


40.65 
41.92 
43.19 
44.46 
45.72 


46.98 
59.34 
71.42 
83.30 


95.02 
106.63 
118.14 
129.56 


990 


6.63 
9.21 
11.34 
13.28 


15.09 
16.81 
18.48 
20.09 
21.67 


23.21 
24.72 
26.22 
27.69 
29.14 


30.58 
32.00 
33.41 
34.81 
36.19 


37.57 
38.93 
40.29 
41.64 
42.98 


44.31 
45.64 
46.96 
48.28 
49.59 


50.89 
63.69 
76.15 
88.38 


100.43 
112.33 
124.12 
135.81 


995 


7.88 
10.60 
12.84 
14.86 


16.75 
18.55 
20.28 
21.96 
23.59 


25.19 
26.76 
28.30 
29.82 
31.32 


32.80 
34.27 
35.72 
37.16 
38.58 


40.00 
41.40 
42.80 
44.18 
45.56 


46.93 
48.29 
49.64 
50.99 
52.34 


53.67 
66.77 
79.49 
91.95 


104.22 
116.32 
128.30 
140.17 


.999 


10.83 
13.82 
16.27 
18.47 


20.52 
22.46 
24.32 
26.13 
27.88 


29.59 
31.26 
32.91 
34.53 
36.12 


37.70 
39.25 
40.79 
42.31 
43.82 


43.32 
46.80 
48.27 
49.73 
51.18 


52.62 
54.05 
55.48 
56.89 
58.30 


59.70 
73.40 
86.66 
99.61 


112.32 
124.84 
137.21 
149.45 


Table A5 Table of Critical T Values for Wilcoxon's Signed-Ranks 
and Matched-Pairs Signed-Ranks Test 














One-tailed level of significance One-tailed level of significance 
.05 .025 01 005 05 025 01 .005 
Two-tailed level of significance Two-tailed level of significance 

.10 05 02 01 10 .05 .02 01 

n n 
5 0 - - - 28 130 116 101 91 
6 2 0 - - 29 140 126 110 100 
7 3 2 0 — 30 151 137 120 109 
8 5 3 1 0 31 163 147 130 118 
9 8 5 3 1 32 175 159 140 128 
10 10 8 5 3 33 187 170 151 138 
11 13 10 7 5 34 200 182 162 148 
12 17 13 9 7 35 213 195 173 159 
13 21 17 12 9 36 227 208 185 171 
14 25 21 15 12 37 241 221 198 182 
15 30 25 19 15 38 256 235 211 194 
16 35 29 23 19 39 271 249 224 207 
17 41 34 27 23 40 286 264 238 220 
18 47 40 32 27 41 302 279 252 233 
19 53 46 37 32 42 319 294 266 247 
20 60 52 43 37 43 336 310 281 261 
21 67 58 49 42 44 353 327 296 276 
22 75 65 55 48 45 371 343 312 291 
23 83 73 62 54 46 389 361 328 307 
24 91 81 69 61 47 407 378 345 322 
25 100 89 76 68 48 426 396 362 339 
26 110 98 84 75 49 446 415 379 355 
27 119 107 92 83 50 466 434 397 373 
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Table A6 Table of the Binomial Distribution, Individual Probabilities 





© 2000 by Chapman & Hall/CRC 


Table A6 Table of the Binomial Distribution, Individual Probabilities (continued). 





n z 05 10 15 .20 25 |.30 35 40 45 -50 
9 0 -6302, .3874  .2316  .1342  .0751. .0404. .0207  .0101  .0046  .0020 
1 .2085  .3874  .3679 3020 .2253° ..1556 | .1004  .0605 0339  .0176 
2 .0629 — .1722  .2597  .3020  .3003  .2668  .2162  .1612  .1110  .0703 
3 .0077 — .0446  .1069  .17602  .2336  .2668  .2716  .2508 | .2119  .1041 
4 .0006 — .0074 .0283 .0661  .1168  .1715  .2194  .2508 .2600 .2461 
$ .0000 — .0008 .0050  .0165  .0389. .0735  .1181  .1672  .2128  .2461 
6 .0000 — .0001  .0006  .0028  .0087  .0210  .0424  .0743  .1160 1641 
7 .0000 .0000 .0000 .0003  .0012 0039  .0098  .0212  .0407  .0703 
8 .0000 — .0000 .0000  .0000 .0001  .0004  .0013  .0035  .0083  .0176 
9 .0000 .0000 .0000  .0000  .0000  .0000  .0001  .0003  .0008  .0020 
10 0 .5987  .3487  .1969  .1074  .0563 ..0282  .0135  .0060  .0025  .0010 
1 -3151  .3874  .3474  .2684  .1877  .1211  .0725  .0403  .0207  .0098 
2 .0746 — .1937  .2759  .3020  .2816  .2335  .1757  .1209  .0763  .0439 
3 .0105 — .0574  .1298  .2013  .2503  .2868  .2522  .2150  .1665  .1172 
4 .0010 — .0112  .0401  .0881  .1460  .2001  .2377  .2508  .2384  .2051 
$ .0001 ^ .0015 .0085  .0264  .0584  .1029  .1538  .2007  .2310  .2461 
6 .0000 | .0001  .0012  .0055  .0162  .0368  .0689  .l115  .1596  .2051 
7 .0000 — .0000 .0001  .0008  .0031  .0090  .0212  .0425  .0746  .1172 
8 .0000 — .0000 .0000  .0001. .0004  .0014  .0043  ..0106  .0229  .0439 
9 .0000 — .0000  .0000  .0000  .0000  .0001  .0005  .0016  .0042  .0098 
10 .0000 — .0000  .0000  .0000  .0000  .0000  .0000  .0001  .0003  .0010 - 
ll 0 .5688 .3138  .1673  .0859  .0422  .0198  .0088  .0036  .0014 .0004 
1 .3293 ` .3835  .3248 ..2362  .1549  .0932  .0518  .0266  .01258  .0055 
2 .0867 — .2131  .28660  .2953  .2581  .1998  .1395  .0887  .0513  .0269 
3 .0137  .0710 . .1517  .2215  .2581  .2568  .2254  .1774  .1259  .0806 
4 .0014 — .0158  .0536  .1107  .1721  .2201. .2428  .2365  .2060  .1611 
$ .0001 .0025  .0132  .0388  .0803. .1321  .1830  .2207  .2300  .2256 
6 .0000 — .0003 .0023  .0097  .0268  .0566  .0985  .1471  .1931  .2256 
7 .0000 — .0000  .0003  .0017  .0064  .0173  .0379  .0701  .1128 .1611 
8 .0000 — .0000  .0000  .0002  .0011  .0037  .0102  .0234  .0462  .0806 
9 .0000 .0000  .0000  .0000  .0001  .0005  .0018  .0052  .0126  .0269 
10 .0000 — .0000  .0000  .0000  .0000  .0000  .0002  .0007  .0021  .0054 
11 .0000 — .0000  .0000  .0000  .0000  .0000  .0000  .0000 ..0002  .0005 
12 .5404 — .2824  .1422  .0687  .0317  .0138  .0057  .0022  .0008  .0002 


0 a 
1 .9413  .3768  .3012 . .2062  .1267 . .0712  .0368  .0174  .0075  .0029 
2 .0988 — .2301  .2924  .2835  .2323 .1678  .1083  .0639  .0339  .0161- 
3 .0173 — .0852  .1720  .2362  .2581  .2397  .1954  .1419  .0923  .0537 
4 .0021 .0213  .0883  .1329  .1938  .2311  .2387  .2128  .1700  .1208 


$ .0002 — .0038  .0193  .0532  .1032  .1585  .2039  .2270  .2225  .1934 
6 -0000. .0005  .0040  .0155  .0401  .0792  .1281  .1766  .2124  .2256 
7 .0000 — .0000  .0006  .0033  .0115  .0291  .0591  .1009  .1489  .1934 
8 .0000 — .0000  .0001  .0005  .0024  .0078  .0199  .0420  .0762  .1208 
9 .0000 — .0000  .0000  .0001  .0004  .0015  .0048  .0125  .0277  .0537 


10 .0000 — .0000 .0000.  .0000  .0000  .0002  .0008  .0025  .0068  .0161 
11 .0000 . .0000  .0000 .6000  .0000 .0000  .0001  .0003  .0010  .0029 
12 0000 0000 0000 0000 0000 0000 0000 0000 0001 0002 
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Table A6 Table of the Binomial Distribution, Individual Probabilities (continued) 
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Table A7 Table of the Binomial Distribution, Cumulative Probabilities 





a 


re) 
eel" 
3 . 
- 
a 


e 
C3 N — 
T 
$95 
"- o 


me WN — 
on 


. 2262 
.0226 
.0012 


C» » CJ) t3 »- 


.2649 
.0328 


AUNE URS NANNE oO NAOUN 


oo -10 


.3698 
.0712 


C mW t9 — 
"m 2 
Ps 


«000-10 


.4686 
.1143 
.0158 
.0013 
.0001 


. 5695 
. 1869 


.6126 


.6794 


.0738 
.0121 
.0012 


.0001 


.721$ 
.3428 
.1052 
.0214 
.0029 


.7684 
.4005 
.1409 
.0339 


.0006 


. 5904 
- 1808 
.0272 
.0016 


.0723 
.2027 
.0579 
. .0067 


. 7379 
.3447 
.0989 
.0170 
.0016 


.0001 


.8322 
.4967 
.2031 
.0563 
.0104 
.0012 
.0001 
. 8658 
-2618 
.0196 


.0031 


- 7627 


“1035 
0156 
‘0010 


.8220 
.4661 
.1694 
.0376 


.8685 
.9551 
.2436 
.0706 
.0129 


. 0013 
.0001 


.8099 


.3215 
.1138 
.0273 


.9249 
.6997 
.3993 
.1657 


.0100 
.0013 
.0001 


.8319 
.4718 
'.1631 


.0024 


.8824 
.5798 
.2557 
.0705 
.0109 


.9176 
.6706 
.3529 
.1260 


.0038 
.9424 


. 7447 


.1941 
.0580 


.0113 
.0013 
.0001 
.9596 
. 5372 
.2703 


.0253 
.0043 


.0018 


.9510 
.7002 


"1998 
0556 


.9793 
.8789 


.3911 
.1717 


.0536 
.0112 
.0014 
.0001 


.1600 


. 7840 
.3520 
.0640 


.8704 
.5248 
.1792 
.0256 


.9222 


.3174 
.0870 
.0102 


.9533 
.7667 


.1792 
.0410 


.0041 


.9720 
.8414 


.2898 
.0963 


.0188 
.0016 


.9832 
.8936 


.4059 
.1737 


.0498 
.0085 
.0007 


.9899 
.9295 
.7682 
.5174 
.2066 


.0994 
.0250 
.0038 


.0083 


.9848 
.8976 


.3917 
.1529 


.0357 
.0037 


.9916 
.9368 
.7799 
.5230 
.2604 


.0885 
.0181 
.0017 


.9954 
.9015 
.8505 
.6386 
.3786 


— — — —— — —— — —— es, — 


.8750 
.1250 


.9375 
.6875 
.3125 
.0625 


.9688 
.8125 


.1875 
.0312 


.9844 


.6562 
13438 
1094 


.0156 


.9922 
.9375 
.7134 


.2268 


.0625 
.0078 


.9961 
.9648 
.8555 
.6367 
.3633 


.1445 
.0352 
.0039 


.9980 
.9805 
.9102 
.7401 


.2539 
.0898 
.0195 
.0020 
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Table A7 Table of the Binomial Distribution, Cumulative Probabilities (continued) 





a 

^0 voeO ous 
re 
ba 
tn 


= 
O «o o 


.4312 
.1019 
.0152 
.0016 
.0001 


11 


Cn od C9 t2 — 





SCO 0 -10 


m 


12 1 .4596 
2 .1184 
3 .0196 
4 .0022 
5 


.0002 


13 1 .4867 
2 .1354 
3 .0245 
4 .0031 
$ 


.0003 





.6862 
-3026 
.0896 
.0185 


.7176 
.3410 
.1109 
.0256 


.7458 
.3787 
.1339 
.0342 
.0085 
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.8327 
. 5078 
.2212 
.0694 
.0159 


.8791 
-8017 


.9313 
.7251 
4417 
. 2054 
.0726 


.0194 
.0039 


.0001 


.9450 
- 7864 
.4983 
.2527 
.0991 


.0300 
.0070 
.0012 
.0002 
.0000 


.9683 
.8416 
.6093 
.3512 
.1576 


.0544 
.0143 
.0028 
.0004 
.0000 


.0000 


.9762 
.8733 
.6674 
(4157 
.2060 


.0802 
.0243 
.0056 
.0010 
.0001 


0000 


.0000 


.9862 
.9150 
.7472 
.5075 
.2763 


.1178 
.0386 
.0095 
.0017 


.0000 


.9903 
.9363 
.7975 
.9794 
.3457 


.1654 
.0624 
.0182 
.0040 
.0007 


0001 
0000 


.0000 


. 9943 
.9576 
.8487 
.6533 
.4167 


.2127 
.0846 
.0255 
.0056 
.0008 


.0001 


.9963 
.9704 


.7217 
.4995 


.2841 
.1295 
.0462 
.0126 
.0028 


0003 


.9978 
.9804 
.9166 
. 7747 
.9618 


.3348 
.1582 
.0573 
.0153 
.0028 


.0003 


.9987 
.9874 
.9421 
.8314 
.6470 


.4256 
.2288 
.0977 
.0321 
.0078 


0013 
0001 
0000 


.9992 
.9917 
.9579 
.8655 
.6956 


.4731 
.2607 
.1117 
.0356 
.0079 


.0011 
.0001 


.9996 
.9951 
.9731 
.9071 
.7721 


.5732 
.3563 
.1788 
.0698 
.9203 


0041 
0005 





——— «= — Áo ema am nt ee MÀ ret 


. 5000 
. 2744 
.1133 
.0327 
.0059 


.0005 


. 9998 
. 9968 
. 9807 
.9270 
.8062 


.6128 
.3872 
.1938 
.0730 
.0193 


.0032 
.0002 


.9999 
.9983 
.9888 
.9539 
.8666 


.7095 
.5000 
.2905 
.1334 
.0461 


0112 
0017 
0001 


Table A7 Table of the Binomial Distribution, Cumulative Probabilities (continued) 
enc c MEE ee NN 


T 


n z .05 10 15 -20 25 30 35 .40 45 50 
14 l .5123 .7712  .8972  .9560  .9822  .9932 .9978 | .9992  .9998  .9999 
2 .1530 — .4154 .6433 —.8021  .8990  .9525  .9795  .9919  .9971  .9991 
3 .0301 — .1584  .3821  .5519  .7189  .8392  .9161  .9802  .9830  .9935 
4 .0042 — .0441  .1465  .3018  .4787  .6448  .7795  .8757  .9368  .9713 
5 .0004 .0092 .0467 .1298  .2585 .4158  .5773 .7207 .8328  .9102 
6 .0000  .0015  .0115  .0439  .1117  .2195  .3505  .5141  .6627  .7880 
7 .0000  .0002  .0022  .0118  .0383  .0933  .1836  .3075  .4539  .6047 
8 .0000 | .0000  .0003  .0024  .0103  .0315  .0753  .1501  .2586  .3953 
9 .0000  .0000 .0000  .0004  .0022  .0083  .0243  .0583  .1189  .2120 
10 .0000 .0000 | .0000  .0000  .0003  .0017  .0060  .0175  .0426  .0898 
li .0000 | .0000  .0000  .0000 .0000  .0002  .0011 .0039  .0114 .0287 
12 .0000 — .0000 .0000  .0000  .0000. .0000  .0001  .0006  .0022  .0065 
13 .0000 — .0000  .0000  .0000  .0000  .0000  .0000  .0001  .0003  .0009 
14 .0000 — .0000  .0000  .0000  .0000  .0000  .0000  .0000  .0000  .0001 
15 .9367  .7941  .9126  .9648  .0860  .9953  .9984 . ,9995  .9999 1.0000 


1 

2 .1710 .4510  .60814  .8329  .9198  .9647  .9858  .9948  .9983  .9995 
3 .0362  .1841  .3958  .6020  .7639  .8732  .9383  .9729  .9893  .9963 
4 .0055 = .0556 — .1773  .3518  .5387  .7031  .8273  .9095  .9576  .9824 
$ .0006 — .0127  .0617  .1642  .3135  .4845 .6481  .7827  .8796  .9408 


6 .0001 — .0022  .0168  .0611  .1484  .2784  .4357  .5968  .7392  .8491 
7 .0000 — .0003 .0036  .0181  .0566  .1311  .2452  .3002  .5478  .0964 
8 .0000 .0000  .0008  .0042  .0173  .0500  .1132  .2131  .3465  .5000 
9 .0000 ^ .0000  .0001  .0008  .0042  .0152  .0422  .0950  .1818  .3036 
0 .0000 | .0000  .0000  .0001  .0008  .0037  .0124  .0338  .0769  .1509 


11 .0000 .0000 .0000  .0000  .0001  .0007  .0028  .0093  .0255  .0592 
12 .0000 | .0000  .0000  .0000  .0000  .0001  .0005  .0019  .0063  .0176 
13 .0000 — .0000  .0000  .0000  .0000  .0000  .0001  .0003  .0011  .0037 
14 0000 0000 0000 0000 0000 0000 0000 0000 0001 0005 
15 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
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Table A8 Table of Critical Values for the Single-Sample Runs Test 


Numbers listed are tabled critical two-tailed .05 and one-tailed .025 values. 









































n 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
n, 
2 2. De 328 2.ux De AD 2: 2 
3 Quid oA 12» Qe (22-7320 2 225 030 038 Bn OB L3 73 
4 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 
9 9 
5 2 2 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 
9 10 10 11 11 
6 2 2 3 3 3 3 4 4 4 4 5 5 5 5 5 5 6 6 
— 9 10 11 12 12 13 13 13 13 
7 2 2 3 3 3 4 4 5 5 5 5 5 6 6 6 6 6 6 
- — ] 12 13 13 14 14 14 14 15 15 15 
8 2 3 3 3 4 4 5 5 5 6 6 6 6 6 7 7 71 7 
—- — ] 12 13 14 14 15 15 16 16 16 16 17 17 17 17 17 
9 2 3 3 4 4 5 5 5 6 6 6 7 7 7 7 8 8 8 
- — - 13 14 14 15 16 16 16 17 17 18 18 18 18 18 18 
10 2 3 3 4 5 5 5 6 6 7 7 7 7 8 8 8 8 9 
—- — - 13 14 15 16 16 17 17 18 18 18 19 19 19 20 20 
11 2 3 4 4 5 5 6 6 7 7 7 8 8 8 9 9 9 9 
- — — 13 14 15 16 17 17 18 19 19 19 20 20 20 21 21 
12 2 2 3 4 4 5 6 6 7 7 7 8 8 8 9 9 9 10 10 
- - - = 13 14 16 16 17 18 19 19 20 20 21 21 21 22 22 
13 2 2 3 4 5 5 6 6 7 7 8 8 9 9 9 10 10 10 10 
15 16 17 18 19 19 20 20 21 21 22 22 23 23 
14 2 2 3 4 5 5 6 7 7 8 8 9 9 9 10 10 10 11 I 
15 16 17 18 19 20 20 21 22 22 23 23 23 24 
15 2 3 3 4 5 6 6 7 7 8 8 9 9 10 10 11 11 11 12 
15 16 18 18 19 20 21 22 22 23 23 24 24 25 
16 2 3 4 4 5 6 6 7 8 8 9 9 10 10 11 11 11 12 12 
17 18 19 20 21 21 22 23 23 24 25 25 25 
17 2 3 4 4 5 6 7 7 8 9 9 10 10 11 11 11 12 12 13 
17 18 19 20 21 22 23 23 24 25 25 26 26 
18 2 3 4 5 5 6 7 8 8 9 9 10 10 11 11 12 12 13 13 
17 18 19 20 21 22 23 24 25 25 26 26 27 
19 2 3 4 5 6 6 7 8 8 9 10 10 11 11 12 12 13 13 13 
17 18 20 21 22 23 23 24 25 26 26 27 27 
20 2 3 4 5 6 6 7 8 9 9 10 10 11 12 12 13 13 13 14 
17 18 20 21 22 23 24 25 25 26 27 27 28 
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Table A9 Table of the F nax Distribution 


The .05 critical values are in lightface type, and the .01 critical values are in bold type. 


k 2 3 4 5 6 7 8 9 10 11 12 
n-1 
2 39 87.5 142 202 266 333 403 475 550 626 704 
199 448 729 1036 1362 1705 2063 2432 2813 3204 3605 
3 15.4 27.8 392 507 62 72.9 83.5 93.9 104 114 124 
47.5 85 120 151 184 216* 249* 281* 310* 337* 361* 
4 9.60 15.5 20.66 252 295 336 375 414 446 480 514 
232 37 49 59 69 79 89 97 106 113 120 
5 7.15 10.8 13.7 163 18.7 208 229 247 265 282 29.9 
14.9 22 28 33 38 42 46 50 54 57 60 
6 5.82 8.38 104 12.1 13.7 150 163 17.5 18.6 197 207 
11.1 15.5 19.1 22 25 27 30 32 34 36 37 
7 499 694 $844 9.70 108 11.8 127 135 143 15.1 15.8 
8.89 12.1 145 165 184 20. 22 23 24 26 27 
8 4.43 600 7.18 812 9.03 9.78 10.5 11.1 117 122 12.7 
7.50 9.9 11.7 132 145 15.8 169 179 18.9 19.8 21 
9 4.03 5.34 631 711 7.80 841 895 945 991 10.3 10.7 
6.54 8.5 9.9 11.1 12.1 13.1 139 147 153 160 16.6 
10 3.72 485 5.67 634 692 7.42 7.87 828 866 9.01 9.34 
5.85 7.4 8.6 9.6 10.4 11.1 118 124 12. 13.4 13.9 
12 3.28 4.16 479 530 5.72 6.09 642 6.72 7.00 725 7.48 
4.91 6.1 6.9 7.6 8.2 8.7 9.1 9.5 9.9 10.2 10.6 
15 2.86 3.54 401 4.37 468 495 5.19 540 5.59 5.77 5.93 
4.07 4.9 5.5 6.0 6.4 6.7 7.1 7.3 7.5 7.8 8.0 
20 2.46 2.95 329 3.54 3.76 394 410 424 437 449 4.59 
3.32 3.8 4.3 4.6 4.9 5.1 5.3 5.5 5.6 5.8 5.9 
30 2.00 2.440 2.61 2.78 291 302 3.12 321 329 3.36 3.39 
2.63 3.0 3.3 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 
60 1.67 1.85 196 204 2.11 2.17 222 226 2.30 2.33 2.36 


196 2.2 2.3 2.4 2.4 2.5 2.5 2.6 2.6 2.7 2.7 


* The third digit in these values is an approximation 
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Table A10 Table of the F Distribution 
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Table A10 Table of the F Distribution (continued) 
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Table A10 Table of the F Distribution (continued) 
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Table A10 Table of the F Distribution (continued) 
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Table A11 Table of Critical Values for Mann-Whitney U Statistic 


(Two-Tailed .05 Values) 
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Table A11 Table of critical Values for Mann-Whitney U Statistic (continued) 


(Two-Tailed .01 Values) 
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Table A12 Table of Sandler's A Statistic 


One-tailed level of significance 
.05 .025 01 005 .0005 





Two-tailed level of significance 


10 .05 .02 01 001 
df =n-1 
1 5125 5031 50049 50012 5000012 
2 412 369 347 340 334 
3 385 324 .286 272 .254 
4 .376 .304 .257 .238 211 
5 312 .293 .240 .218 .184 
6 .370 .286 .230 .205 .167 
7 .369 .281 222 .196 .155 
8 .368 278 .217 .190 .146 
9 .368 276 .213 .185 .139 
10 .368 274 .210 .181 .134 
11 .368 273 .207 .178 .130 
12 .368 271 .205 .176 .126 
13 .368 .270 .204 .174 .124 
14 .368 .270 .202 .172 121 
15 .368 .269 .201 .170 .119 
16 .368 .268 .200 .169 .117 
17 .368 .268 .199 .168 .116 
18 .368 .267 .198 .167 .114 
19 .368 .267 .197 .166 .113 
20 .368 .266 .197 .165 .112 
21 .368 .266 .196 .165 111 
22 .368 .266 .196 .164 .110 
23 .368 .266 .195 .163 .109 
24 .368 .265 .195 .163 .108 
25 .368 .265 .194 .162 .108 
26 .368 .265 .194 .162 .107 
27 .368 .265 .193 .161 .107 
28 .368 .265 .193 .161 .106 
29 .368 .264 .193 .161 .106 
30 .368 .264 .193 .160 .105 
40 .368 .263 .191 .158 .102 
60 .369 .262 .189 .155 .099 
120 .369 .261 .187 .153 .095 
oo .370 .260 .185 .151 .092 
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Table A13 Table of the Studentized Range Statistic 
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Table A13 Table of the Studentized Range Statistic (continued) 
4.59 (a = .01) 
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Table A14 Table of Dunnett’s Modified ¢ Statistic for a Control Group Comparison 


The .05 critical values are in lightface type, and the .01 critical values are in bold type. 


df, error 


5 


6 


10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
24 
30 
40 
60 


120 


2 


2.57 
4.03 
2.45 
3.71 
2.36 
3.50 
2.31 
3.36 
2.26 
3.25 
2.23 
3.17 
2.20 
3.11 
2.18 
3.05 
2.16 
3.01 
2.14 
2.98 
2.13 
2.95 
2.12 
2.92 
2.11 
2.90 
2.10 
2.88 
2.09 
2.86 
2.09 
2.85 
2.06 
2.80 
2.04 
2.75 
2.02 
2.70 
2.00 
2.66 
1.98 
2.62 
1.96 


2.58 


Two-Tailed Values 


k = number of treatment means, including control 


3 


3.03 
4.63 
2.86 
4.21 
2.75 
3.95 
2.67 
3.77 
2.61 
3.63 
251 
3.53 
2.53 
3.45 
2.50 
3.39 
2.48 
3.33 
2.46 
3.29 
2.44 
3.25 
2.42 
3.22 
2.41 
3.19 
2.40 
3.17 
2.39 
3.15 
2.38 
3.13 
2.35 
3.07 
2.32 
3.01 
2.29 
2.95 
2.27 
2.90 
2.24 
2.85 
2.21 


2.79 
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4 


3.29 
4.98 
3.10 
4.51 
2.97 
4.21 
2.88 
4.00 
2.81 
3.85 
2.76 
3.74 
2:72 
3.65 
2.68 
3.58 
2.65 
3.52 
2.63 
3.47 
2.61 
3.43 
2.59 
3.39 
2.58 
3.36 
2.56 
3.33 
2:35 
3.31 
2.54 
3.29 
2.51 
322 
2.47 
3.15 
2.44 
3.09 
2.41 
3.03 
2.38 
2.97 
2.35 
2.92 


5 


3.48 
5.22 
3.26 
4.71 
3.12 
4.39 
3.02 
4.17 
2.95 
4.01 
2.89 
3.88 
2.84 
3.79 
2.81 
3.71 
2.78 
3.65 
2.75 
3.59 
2.73 
3.55 
2.71 
3.51 
2.69 
3.47 
2.68 
3.44 
2.66 
3.42 
2.65 
3.40 
2.61 
3.32 
2.58 
3.25 
2.54 
3.19 
2.51 
3.12 
2.47 
3.06 
2.44 


3.00 


6 


3.62 
5.41 
3.39 
4.87 
3.24 
4.53 
3.13 
4.29 
3.05 
4.12 
2.99 
3.99 
2.94 
3.89 
2.90 
3.81 
2.87 
3.74 
2.84 
3.69 
2.82 
3.64 
2.80 
3.60 
2.78 
3.56 
2.76 
3.53 
2.75 
3.50 
2.73 
3.48 
2.70 
3.40 
2.66 
3.33 
2.62 
3.26 
2.58 
3.19 
2.55 
3.12 
2.51 


3.06 


7 


3.73 
5.56 
3.49 
5.00 
3.33 
4.64 
3.22 
4.40 
3.14 
4.22 
3.07 
4.08 
3.02 
3.98 
2.98 
3.89 
2.94 
3.82 
2.91 
3.76 
2.89 
3.71 
2.87 
3.67 
2.85 
3.63 
2.83 
3.60 
2.81 
3.57 
2.80 
3.55 
2.76 
3.47 
2.72 
3.39 
2.68 
3.32 
2.64 
3.25 
2.60 
3.18 
2.57 


3.11 


8 


3.82 
5.69 
3.57 
5.10 
3.41 
4.74 
3.29 
4.48 
3.20 
4.30 
3.14 
4.16 
3.08 
4.05 
3.04 
3.96 
3.00 
3.89 
2.97 
3.83 
2.95 
3.78 
2.92 
3.73 
2.90 
3.69 
2.89 
3.66 
2.87 
3.63 
2.86 
3.60 
2.81 
3.52 
2.77 
3.44 
2.73 
3.37 
2.69 
3.29 
2.65 
3.22 
2.61 


3.15 


3.90 
5.80 
3.64 
5.20 
3.47 
4.82 
3.35 
4.56 
3.26 
4.37 
3.19 
4.22 
3.14 
4.11 
3.09 
4.02 
3.06 
3.94 
3.02 
3.88 
3.00 
3.83 
2.97 
3.78 
2.95 
3.74 
2.94 
3.71 
2.92 
3.68 
2.90 
3.65 
2.86 
3.57 
2.82 
3.49 
2.77 
3.41 
2.73 
3.33 
2.69 
3.26 
2.65 


3.19 


10 


3.97 
5.89 
3.71 
5.28 
3,53 
4.89 
3.41 
4.62 
3.32 
4.43 
3.24 
4.28 
3.19 
4.16 
3.14 
4.07 
3.10 
3.99 
3.07 
3.93 
3.04 
3.88 
3.02 
3.83 
3.00 
3.79 
2.98 
3.75 
2.96 
3.72 
2.95 
3.69 
2.90 
3.61 
2.86 
3.52 
2.81 
3.44 
2.77 
3.37 
2.73 
3.29 
2.69 


3.22 


Table A14 Table of Dunnett's Modified ¢ Statistic for a Control Group Comparison 
(continued) 


df, error 


5 


6 


10 


11 


12 


13 


14 


15 


16 


17 


18 


19 


20 


24 


30 


40 


60 


120 


2 


2.02 
3.37 
1.94 
3.14 
1.89 
3.00 
1.86 
2.90 
1.83 
2.82 
1.81 
2.76 
1.80 
2.72 
1.78 
2.68 
1.77 
2.65 
1.76 
2.62 
1.75 
2.60 
1.75 
2.58 
1.74 
2.57 
1.73 
2.55 
1.73 
2.54 
1.72 
2.53 
1.71 
2.49 
1.70 
2.46 
1.68 
2.42 
1.67 
2.39 
1.66 
2.36 
1.64 
2.33 


One-Tailed Values 


k = number of treatment means, including control 


3 


2.44 
3.90 
2.34 
3.61 
2.27 
3.42 
2.22 
3.29 
2.18 
3.19 
2.15 
3.11 
2.13 
3.06 
2.11 
3.01 
2.09 
2.97 
2.08 
2.94 
2.07 
2.91 
2.06 
2.88 
2.05 
2.86 
2.04 
2.84 
2.03 
2.83 
2.03 
2.81 
2.01 
2.77 
1.99 
2.72 
1.97 
2.68 
1.95 
2.64 
1.93 
2.60 
1.92 
2.56 
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4 


2.68 
4.21 
2.56 
3.88 
2.48 
3.66 
2.42 
3.51 
2.37 
3.40 
2.34 
3.31 
2.31 
3.25 
2.29 
3.19 
2.27 
3.15 
2.25 
3.11 
2.24 
3.08 
2.23 
3.05 
2.22 
3.03 
221 
3.01 
2.20 
2.99 
2.19 
2.97 
2.17 
2.92 
2.15 
2.87 
2.13 
2.82 
2.10 
2.78 
2.08 
2.73 
2.06 
2.68 


5 


2.85 
4.43 
2.71 
4.07 
2.62 
3.83 
2.95 
3.67 
2.50 
3.55 
2.47 
3.45 
2.44 
3.38 
2.41 
3.32 
2.39 
3.27 
2.37 
3.23 
2.36 
3.20 
2.34 
3.17 
2.33 
3.14 
2.32 
3.12 
2.31 
3.10 
2.30 
3.08 
2.28 
3.03 
2.25 
2.97 
2.23 
2.92 
2.21 
2.87 
2.18 
2.82 
2.16 
2.77 


6 


2.98 
4.60 
2.83 
4.21 
2.73 
3.96 
2.66 
3.79 
2.20 
3.66 
2.56 
3.56 
2.53 
3.48 
2.50 
3.42 
2.48 
3.37 
2.46 
3.32 
2.44 
3.29 
2.43 
3.26 
2.42 
3.23 
2.41 
3.21 
2.40 
3.18 
2.39 
3.17 
2.36 
3.11 
2.33 
3.05 
2.31 
2.99 
2.28 
2.94 
2.26 
2.89 
2.23 
2.84 


7 


3.08 
4.73 
2.92 
4.33 
2.82 
4.07 
2.74 
3.88 
2.68 
3.75 
2.64 
3.64 
2.60 
3.56 
2.58 
3.50 
2.55 
3.44 
2.53 
3.40 
2.51 
3.36 
2.50 
3.33 
2.49 
3.30 
2.48 
3.27 
2.47 
3.25 
2.46 
3.23 
2.43 
3.17 
2.40 
3.11 
2.37 
3.05 
2.35 
3.00 
2.32 
2.94 
2.29 
2.89 


8 


3.16 
4.85 
3.00 
4.43 
2.89 
4.15 
2.81 
3.96 
2.75 
3.82 
2.70 
3.71 
2.67 
3.63 
2.64 
3.56 
2.61 
3.51 
2.59 
3.46 
2.57 
3.42 
2.56 
3.39 
2.54 
3.36 
2.53 
3.33 
2.52 
3.31 
2.51 
3.29 
2.48 
322 
2.45 
3.16 
2.42 
3.10 
2.39 
3.04 
2.37 
2.99 
2.34 
2.93 


324 
4.94 
3.07 
4.51 
2.95 
4.23 
2.87 
4.03 
2.81 
3.89 
2.76 
3.78 
2:12 
3.69 
2.60 
3.62 
2.66 
3.56 
2.64 
3.51 
2.62 
3.47 
2.61 
3.44 
2.59 
3.41 
2.58 
3.38 
2.57 
3.36 
2.56 
3.34 
2:33 
3.27 
2.50 
321 
2.47 
3.14 
2.44 
3.08 
2.41 
3.03 
2.38 
2.97 


10 


3.30 
5.03 
3.12 
4.59 
3.01 
4.30 
2.92 
4.09 
2.86 
3.94 
2.81 
3.83 
2.77 
3.74 
2.74 
3.67 
2.71 
3.61 
2.69 
3.56 
2.67 
3.52 
2.65 
3.48 
2.64 
3.45 
2.62 
3.42 
2.61 
3.40 
2.60 
3.38 
2.57 
3.31 
2.54 
3.24 
2.51 
3.18 
2.48 
3.12 
2.45 
3.06 
2.42 


3.00 


Table A15 Graphs of the Power Function for the Analysis of Variance 
(Fixed—Effects Model) 
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Table A15 Graphs of the Power Function for the Analysis of Variance (continued) 
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Table A15 Graphs of the Power Function for the Analysis of Variance (continued) 
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Table A15 Graphs of the Power Function for the Analysis of Variance (continued) 
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Table A16 Table of Critical Values for Pearson r 


df = n-2 


nABWN Re 


OO ANDN 


p 


11 


13 
14 
15 


16 
17 
18 
19 
20 


21 
22 
23 
24 
25 


26 
27 
28 
29 
30 


35 
40 
45 
50 
60 


70 
80 
90 
100 
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One-tailed level of significance 





.05 .025 01 005 
Two-tailed level of significance 

.10 .05 .02 01 
.988 .997 .9995 .9999 
.900 .950 .980 .990 
.805 .878 .934 .959 
.729 .811 .882 917 
.669 .754 833 874 
.622 .707 .789 .834 
.582 .666 .750 .798 
549 .632 .716 765 
521 .602 .685 .735 
497 576 .658 .708 
476 553 .634 .684 
.458 .532 .612 .661 
.441 .514 .592 .641 
.426 .497 .574 .623 
.412 .482 .558 .606 
.400 .468 542 590 
.389 .456 .528 575 
378 444 516 561 
.369 .433 .503 .549 
.360 .423 .492 .537 
.352 .413 .482 .526 
.344 .404 .AT2 .515 
337 .396 .462 .505 
.330 .388 .453 .496 
.323 .381 .445 .487 
.317 .374 .437 .479 
311 367 .430 471 
306 361 423 463 
301 .355 .416 .456 
.296 .349 .409 .449 
215 :325 .381 .418 
.257 .304 .358 .393 
.243 .288 .338 372 
231 .273 322 354 
211 .250 .295 .325 
.195 232 274 302 
.183 .217 .256 .283 
.173 .205 .242 .267 
.164 .195 .230 .254 


Table A17 Table of Fisher's z, Transformation 


r Z r Zy r z, r 2 r Z 
.000 .000 .200 .203 .400 .424 .600 .693 .800 1.099 
.005 .005 .205 .208 .405 .430 .605 -701 .805 1.113 
.010 .010 .210 213 .410 .436 .610 .709 .810 1.127 
.015 .015 215 .218 415 442 .615 717 815 1.142 
.020 .020 .220 .224 420 448 .620 725 .820 1.157 
.025 .025 225 .229 .425 .454 .625 .733 825 1.172 
.030 .030 .230 .234 .430 .460 .630 741 .830 1.188 
.035 .035 .235 .239 .435 .466 .635 .750 .835 1.204 
.040 .040 .240 .245 .440 472 .640 .758 .840 1.221 
.045 .045 245 .250 .445 478 645 767 .845 1.238 
.050 .050 .250 .255 .450 .485 .650 TIS .850 1.256 
.055 .055 .255 .261 .455 .491 .655 .784 855 1.274 
.060 .060 .260 .266 .460 .497 .660 .793 .860 1.293 
.065 .065 .265 271 .465 .504 .665 .802 .865 1.313 
.070 .070 270 277 470 510 .670 811 .870 1.333 
.075 .075 275 282 475 517 .675 .820 .875 1.354 
.080 .080 .280 .288 .480 .523 .680 .829 .880 1.376 
.085 .085 .285 .293 .485 .530 .685 .838 .885 1.398 
.090 .090 .290 .299 .490 .536 .690 .848 .890 1.422 
.095 .095 .295 .304 .495 .543 .695 .858 .895 1.447 
.100 .100 .300 .310 .500 .549 .700 .867 .900 1.472 
.105 .105 .305 .315 .505 .556 .705 .877 .905 1.499 
.110 .110 .310 .321 .510 .563 .710 .887 .910 1.528 
.115 .116 .315 .326 .515 .570 .715 .897 .915 1.557 
.120 .121 .320 332 .520 576 .720 .908 .920 1.589 
.125 .126 :325 337 525 583 725 918 925 1.623 
.130 .131 .330 .343 .530 .590 .730 929 .930 1.658 
.135 .136 335 348 535 597 .735, .940 .935 1.697 
.140 .141 .340 .354 .540 .604 .740 950 .940 1.738 
.145 .146 345 .360 .545 .611 145 .962 .945 1.783 
.150 .151 .350 .365 .550 .618 .750 973 .950 1.832 
.155 .156 .355 371 .555 .626 .755 .984 .955 1.886 
.160 .161 .360 .377 .560 .633 .760 .996 .960 1.946 
.165 .167 .365 .383 .565 .640 .765 1.008 .965 2.014 
.170 .172 .370 .388 .570 .648 .770 1.020 .970 2.092 
.175 77 375 .394 575 655 15 1.033 .975 2.185 
.180 .182 .380 .400 .580 .662 .780 1.045 .980 2.298 
.185 .187 385 .406 .585 .670 185 1.058 .985 2.443 
.190 .192 .390 .412 .590 .678 .790 1.071 .990 2.647 
.195 .198 .395 .418 .595 .685 195 1.085 .995 2.994 
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Table A18 Table of Critical Values for Spearman's Rho 


100 
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One-tailed level of significance 





.05 .025 01 .005 
Two-tailed level of significance 

.10 .05 02 01 

1.000 - - = 

.900 1.000 1.000 - 

.829 .886 .943 1.000 
714 .786 893 929 
.643 .738 .833 .881 
.600 .700 .783 .833 
.564 .648 .145 .194 
.536 .618 .709 .155 
.503 .587 .671 727 
484 560 .648 .703 
.464 .538 .622 .675 
443 521 .604 .654 
.429 .503 .582 .635 
414 485 .566 .615 
.401 472 550 .600 
391 .460 .535 .584 
.380 447 520 570 
370 435 .508 .556 
.361 .425 .496 .544 
.353 415 486 532 
344 .406 .476 .521 
337 .398 .466 .511 
.331 .390 457 501 
324 382 448 491 
317 375 440 .483 
312 368 .433 475 
.306 .362 .425 .467 
.283 .335 .394 .433 
.264 .313 .368 .405 
.248 .294 .347 .382 
.235 .279 329 363 
214 .255 .300 331 
.190 .235 .278 307 
.185 .220 .260 .287 
.174 .207 .245 271 
.165 .197 .233 .257 


Table A19 Table of Critical Values for Kendall's Tau 


Critical values for both € and S are listed in the table. 


Two-tailed 01 02 05 10 .20 
One tailed 005 01 025 05 10 


4 8 1.000 8 1.000 8 1.000 6 1.000 6 1.000 
5 12 1.000 10 1.000 10 1.000 8 .800 8 .800 
6 15 1.000 13 .867 13 .867 11 .733 9 .600 
7 19 .905 17 .810 15 .714 13 .619 11 .524 
8 22 .786 20 714 18 .643 16 571 12 429 
9 26 .722 24 .667 20 .556 18 .500 14 .389 


10 29 .644 27 .600 23 S11 21 467 17 378 
11 33 .600 31 .564 27 .491 23 .418 19 .345 
12 38 576 36 545 30 455 26 394 20 303 
13 44 564 40 513 34 436 28 359 24 .308 
14 47 .516 43 473 37 407 33 363 25 275 
15 53 .505 49 .467 41 .390 35 .333 29 .276 
16 58 .483 52 .433 46 .383 38 .317 30 .250 
17 64 A71 58 .426 50 .368 42 .309 34 .250 
18 69 .451 63 412 53 346 45 294 37 .242 
19 75 .439 67 .392 57 .333 49 .287 39 .228 
20 80 421 72 379 62 326 52 274 42 221 
21 86 410 78 371 66 314 56 .267 44 .210 
22 91 .394 83 .359 71 307 61 .264 47 .203 
23 99 391 89 352 75 .296 65 .257 51 .202 
24 104 377 94 341 80 .290 68 .246 54 .196 
25 110 .367 100 .333 86 .287 72 .240 58 .193 
26 117 .360 107 .329 91 .280 TI :237 61 .188 
27 125 .356 113 .322 95 271 81 .231 63 .179 
28 130 344 118 312 100 .265 86 .228 68 .180 
29 138 .340 126 .310 106 .261 90 .222 70 172 
30 145 .333 131 301 111 .255 95 .218 75 .172 
31 151 325 137 .295 117 .252 99 213 TI .166 
32 160 .323 144 .290 122 .246 104 .210 82 .165 
33 166 314 152 .288 128 242 108 .205 86 .163 
34 175 .312 157 .280 133 .237 113 .201 89 .159 
35 181 .304 165 2 139 .234 117 .197 93 .156 
36 190 .302 172 273 146 .232 122 .194 96 .152 
37 198 .297 178 .267 152 .228 128 .192 100 .150 
38 205 .292 185 .263 157 .223 133 .189 105 .149 
39 213 .287 193 .260 163 .220 139 .188 109 .147 
40 222 .285 200 .256 170 .218 144 .185 112 .144 
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Table A20 Table of Critical Values for Kendall’s Coefficient of Concordance 


Critical values for both $ and W are listed. The values of W in the table were computed by substituting 
the tabled S values in Equation 31.3. 





Values at .05 level of significance 


= 


S W S W S W S W S 


3 64.4 .716 103.9 .660 157.3 .624 
4 49.5 .619 88.4 .552 143.3 .512 217.0 .484 
5 62.6 .501 112.3 .449 182.4 .417 276.2 395 
6 75.7 421 136.1 378 221.4 351 335.2 333 
8 48.1 376 101.7 318 183.7 .287 299.0 .267 453.1 .253 
10 60.0 .300 127.8 .256 2312 .231 376.7 :215 571.0 .204 
15 89.8 .200 192.9 171 349.8 155 570.5 .145 864.9 137 
20 119.7 150 258.0 129 468.5 117 764.4 .109 1158.7 .103 


Values at .01 level of significance 


3 75.6 .840 122.8 .780 185.6 437 
4 61.4 .768 109.3 .683 176.2 .629 265.0 592 
5 80.5 .644 142.8 571 229.4 524 343.8 491 
6 99.5 553 176.1 489 282.4 448 422.6 419 
8 66.8 522 137.4 429 242.7 379 388.3 347 579.9 324 
10 85.1 425 175.3 351 309.1 .309 494.0 .282 731.0 .263 
15 131.0 .291 269.8 .240 475.2 211 758.2 .193 1129.5 .179 
20 177.0 .221 364.2 .182 641.2 .160 1022.2 .146 1521.9 .136 





Additional values for n = 3 





At .05 level. At .01 level. 
m S W S Ww 





9 54.0 .333 75.9 .469 
12 71.9 .250 103.5 .359 
14 83.8 .214 121.9 311 
16 95.8 .187 140.2 274 
18 107.7 .166 158.6 .245 
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Table A21 Table of Critical Values for the Kolmogorov-Smirnov 
Goodness-of-Fit Test for a Single Sample 


One-tailed .10 .05 .025 .01 .005 
Two-tailed 20 10 .050 .02 .010 
n=1 .900 .950 975 .990 .995 
2 .684 716 .842 .900 .929 
3 .565 .636 .708 785 .829 
4 .493 .565 .624 .689 734 
5 447 .509 .563 .627 .669 
6 A10 .468 519 571 .617 
7 381 .436 483 .538 .576 
8 .358 410 454 .507 542 
9 339 387 .430 .480 513 
10 323 369 .409 457 .489 
11 .308 352 .391 437 468 
12 296 338 375 419 .449 
13 285 .325 361 404 432 
14 275 314 .349 .390 A18 
15 .266 .304 .338 371 .404 
16 .258 .295 327 .366 .392 
17 250 .286 318 355 381 
18 244 279 .309 .346 371 
19 237 271 301 337 361 
20 232 .265 .294 .329 352 
21 226 259 287 321 344 
22 221 .253 281 314 337 
23 216 247 275 .307 .330 
24 212 242 .269 .301 323 
25 .208 .238 .264 .295 317 
26 204 233 259 .290 311 
27 .200 229 254 .284 305 
28 197 225 250 279 .300 
29 .193 221 246 275 .295 
30 .190 218 242 270 .290 
31 .187 214 .238 .266 285 
32 .184 211 234 262 281 
33 .182 .208 231 .258 2TI 
34 .179 .205 227 254 273 
35 477 202 224 251 .269 
36 174 199 221 247 .265 
37 172 196 218 244 262 
38 .170 .194 215 241 .258 
39 .168 .191 213 .238 .255 
40 .165 .189 210 .235 252 
n » 40 1.07//n 1.22/,/n 1.36//n 1.52/,/n 1.63//n 
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Table A22 Table of Critical Values for the Lilliefors Test for Normality 


One-tailed 20 .15 .10 .05 .01 
Two-tailed .40 30 20 10 02 
n=4 300 319 352 381 AIT 
5 285 .299 315 337 405 
6 265 277 294 319 364 
7 247 .258 276 .300 348 
8 233 244 261 285 331 
9 223 233 249 271 311 
10 215 224 239 .258 294 
11 .206 217 230 249 284 
12 .199 212 223 242 275 
13 .190 202 214 234 268 
14 183 .194 207 22] 261 
15 A77 487 201 220 257 
16 173 .182 195 213 250 
17 .169 477 .189 206 245 
18 .166 173 184 .200 239 
19 .163 .169 179 .195 235 
20 .160 .166 174 .190 231 
25 142 147 158 173 .200 
30 131 .136 144 161 187 
n > 30 .736/,/n .168/ y/n .805//n .886//n 1.031//n 
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Table A23 Table of Critical Values for the Kolmogorov-Smirnov Test 
for Two Independent Samples 


One-tailed 
Two-tailed 


;S 
z 
N 


ee 
OAADNAPNOTOAWAAAADNAW 


oo o0 0000 00 AAIAIAIANAANADAADAADADAGADAAAACUMAMAAAKAAAKnAHH HHH HHH HWW WW WW WW W 
NO 
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.10 
.20 


.667 
-750 
.667 
.667 
.667 
.625 
.667 
.600 
.583 
-750 
.600 
.583 
.607 
.625 
556 
550 
583 
563 
.600 
.600 
571 
.550 
556 
.500 
.533 
.500 
.500 
.548 
.500 
.500 
.500 
.500 
444 
458 
571 
482 
492 
A471 
429 
429 
.500 
444 
475 
458 
438 
.406 


05 
.10 


.667 
150 
.800 
.667 
.714 
-750 
.667 
-700 
.667 
-750 
.750 
.667 
.714 
.625 
.667 
.650 
.667 
.625 
.600 
.667 
.657 
.625 
.600 
.600 
.600 
.550 
.667 
S71 
583 
556 
567 
583 
556 
.500 
571 
.589 
556 
557 
.500 
.464 
.500 
542 
525 
.500 
.500 
.438 


.025 


05 


.800 
.833 
.857 
-750 
T8 
.800 
150 
150 
.800 
-750 
750 
750 
-750 
-700 
.667 
.688 
.800 
.667 
.714 
.675 
.689 
-700 
.667 
.600 
.667 
.690 
.667 
.667 
.633 
.583 
611 
.583 
.714 
.625 
.635 
.614 
571 
.536 
.625 
.625 
575 
583 
563 
.500 


01 
02 


857 
875 
.889 
.900 
.833 


.800 
.833 
.857 
.875 
778 
.800 
150 
750 
.800 
.833 
.829 
.800 
.TI8 
-700 
.733 
-700 
.833 
.714 
-750 
722 
-700 
.667 
.667 
.625 
.714 
.732 
.714 
-700 
.643 
.607 
.625 
.667 
675 
.625 
.625 
.563 


.005 
01 


.889 
.900 
917 


.833 
.857 
.875 
.889 
.800 
.833 
.812 
.800 
.833 
.857 
.800 
.800 
.800 
93 
150 
.833 
.833 
.750 
778 
.733 
750 
722 
.667 
.714 
-750 
.746 
.714 
.714 
.643 
150 
150 
-700 
.667 
.625 
594 


Table A23 Table of Critical Values for the Kolmogorov-Smirnov Test 
for Two Independent Samples (continued) 





One-tailed .10 .05 .025 01 005 
Two-tailed .20 .10 .05 .02 01 
n, n, 
9 9 444 556 556 .667 .667 
9 10 467 500 578 .667 .689 
9 12 444 500 556 611 .667 
9 15 422 489 .533 .600 .644 
9 18 .389 444 500 556 611 
9 36 361 417 472 528 556 
10 10 .400 .500 .600 .600 .700 
10 15 .400 .467 .500 .567 .633 
10 20 .400 .450 .500 .550 .600 
10 40 .350 .400 .450 .500 .576 
11 11 454 454 545 .636 .636 
12 12 17 417 500 583 583 
12 15 383 450 500 550 .583 
12 16 .375 .438 .479 .542 583 
12 18 361 417 472 528 556 
12 20 367 417 467 517 567 
13 13 385 462 462 538 615 
14 14 357 429 500 500 571 
15 15 333 .400 .467 .467 .533 
16 16 .375 .375 .438 .500 .563 
17 17 .353 412 412 471 529 
18 18 .333 .389 444 500 500 
19 19 316 368 421 473 473 
20 20 .300 .350 .400 .450 .500 
21 21 .286 .333 .381 .429 .476 
22 22 .318 .364 .364 454 454 
23 23 304 348 391 435 435 
24 24 292 .333 .375 417 458 
25 25 .280 .320 .360 .400 .440 
For all other sample sizes 1.07K 1.22K 1.36K 1.52K 1.63K 
+ 
Where: K = Mtm 


© 2000 by Chapman & Hall/CRC 


